dlrover

Community

Enable resilient large-scale distributed training on Kubernetes.

Authordongg622
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the challenge of automating Elastic Distributed Training workflows with fault tolerance, scalability, and efficient checkpointing.

Core Features & Use Cases

  • Fault-tolerant training: Automatically recover from node and process failures during large-scale model training.
  • Elastic scaling: Dynamically adjust the number of worker nodes based on resource availability and training needs.
  • Use Case: A data scientist requires to run a 1000 GPU-job that can automatically recover from failures and scale up/down seamlessly.
  • FlashCheckpoint: Asynchronously save checkpoints during training to reduce downtime and enable quick recovery.
  • Network and cluster health monitoring: Detect and troubleshoot network bottlenecks and environment issues automatically.
  • Hyperparameter tuning and resource optimization: Use integrated features to optimize cluster utilization on Kubernetes.

Quick Start

Launch a fault-tolerant distributed training job by specifying initial nodes, maximum and minimum scales, and checkpoint directory, such as "dlrover --num_nodes=4 --max_node=8 --min_node=2 python train.py".

Dependency Matrix

Required Modules

dlrover

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: dlrover
Download link: https://github.com/dongg622/china-ai-chip-skill/archive/main.zip#dlrover

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.