perf-cuda-graphs
OfficialOptimize GPU workloads with CUDA graph capture techniques.
AuthorNVIDIA-NeMo
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill enables efficient GPU computation by capturing CUDA graphs to reduce host-driver overhead, significantly speeding up training iterations.
Core Features & Use Cases
- CUDA Graph Capture: Supports local full-iteration and TE-scoped graphs for attention, MLP, and MoE modules.
- Performance Optimization: Applicable for training large models with static shapes, improving throughput and reducing iteration time.
- Use Case: Use this Skill to accelerate training of large-scale language models on GPU clusters by minimizing kernel launch overhead.
Quick Start
Use the cuda-graphs skill to enable CUDA graph capture during training by setting the implementation to 'transformer_engine' and configuring the scope for attention and MoE modules.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: perf-cuda-graphs Download link: https://github.com/NVIDIA-NeMo/Megatron-Bridge/archive/main.zip#perf-cuda-graphs Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.