dlrover
CommunityEnable resilient large-scale distributed training on Kubernetes.
Content & Communication#kubernetes#checkpointing#distributed#training#fault-tolerance#elastic scaling
Authordongg622
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill addresses the challenge of automating Elastic Distributed Training workflows with fault tolerance, scalability, and efficient checkpointing.
Core Features & Use Cases
- Fault-tolerant training: Automatically recover from node and process failures during large-scale model training.
- Elastic scaling: Dynamically adjust the number of worker nodes based on resource availability and training needs.
- Use Case: A data scientist requires to run a 1000 GPU-job that can automatically recover from failures and scale up/down seamlessly.
- FlashCheckpoint: Asynchronously save checkpoints during training to reduce downtime and enable quick recovery.
- Network and cluster health monitoring: Detect and troubleshoot network bottlenecks and environment issues automatically.
- Hyperparameter tuning and resource optimization: Use integrated features to optimize cluster utilization on Kubernetes.
Quick Start
Launch a fault-tolerant distributed training job by specifying initial nodes, maximum and minimum scales, and checkpoint directory, such as "dlrover --num_nodes=4 --max_node=8 --min_node=2 python train.py".
Dependency Matrix
Required Modules
dlrover
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: dlrover Download link: https://github.com/dongg622/china-ai-chip-skill/archive/main.zip#dlrover Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.