resiliency

Official

Boost ML training resilience with fault tolerance

AuthorNVIDIA
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Megatron Bridge resiliency features provide fault tolerance, straggler detection, in-process restart, preemption, and a re-run state machine to improve stability of large-scale training workflows. Use when you need hang detection, automatic restart, or checkpoint recovery in distributed training.

Core Features & Use Cases

  • Fault tolerance using nvidia-resiliency-ext to automatically recover from failures during long runs.
  • Straggler detection and graceful handling to maximize resource utilization in multi-node training.
  • Experimental in-process restart and re-run state machine for rapid recovery without full job relaunch.
  • Preemption support and non-blocking checkpointing to minimize downtime during cluster pressure.
  • Async checkpointing and local checkpointing to accelerate recovery and reduce loss of progress.

Quick Start

Enable resiliency features by turning on the FT, NVRx straggler, IPR, and checkpointing options in Megatron Bridge.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: resiliency
Download link: https://github.com/NVIDIA/skills/archive/main.zip#resiliency

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.