nemo-mbridge-perf-cuda-graphs
CommunityReduce host overhead with CUDA graphs.
Software Engineering#gpu-training#cuda-graphs#megatron-bridge#te-scoped-graphs#local-full-iteration#performance-harness#rng-tracker
Authorsayalinvidia
Version1.0.0
Installs0
System Documentation
What problem does it solve?
CUDA graph capture reduces host-driver overhead in Megatron Bridge training by freezing and replaying GPU workloads to optimize throughput and stability.
Core Features & Use Cases
- Local full-iteration graphs using MCore wrappers to capture forward and backward passes.
- Transformer Engine scoped graphs per layer for attn, mlp, moe, moe_router, moe_preprocess, and mamba.
- Clear enablement prerequisites, including static tensor shapes and RNG trackers, to ensure reliable replay.
Quick Start
Launch the TE-scoped or local full-iteration CUDA graphs path with static shapes and enable the RNG tracker to run the performance harness and validate replay.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: nemo-mbridge-perf-cuda-graphs Download link: https://github.com/sayalinvidia/sayali-skills-test/archive/main.zip#nemo-mbridge-perf-cuda-graphs Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.