resiliency
OfficialBoost ML training resilience with fault tolerance
Software Engineering#checkpointing#fault-tolerance#resiliency#megatron-bridge#nvidia-resiliency-ext#straggler-detection#in-process-restart
AuthorNVIDIA
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Megatron Bridge resiliency features provide fault tolerance, straggler detection, in-process restart, preemption, and a re-run state machine to improve stability of large-scale training workflows. Use when you need hang detection, automatic restart, or checkpoint recovery in distributed training.
Core Features & Use Cases
- Fault tolerance using nvidia-resiliency-ext to automatically recover from failures during long runs.
- Straggler detection and graceful handling to maximize resource utilization in multi-node training.
- Experimental in-process restart and re-run state machine for rapid recovery without full job relaunch.
- Preemption support and non-blocking checkpointing to minimize downtime during cluster pressure.
- Async checkpointing and local checkpointing to accelerate recovery and reduce loss of progress.
Quick Start
Enable resiliency features by turning on the FT, NVRx straggler, IPR, and checkpointing options in Megatron Bridge.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: resiliency Download link: https://github.com/NVIDIA/skills/archive/main.zip#resiliency Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.