perf-torch-cuda-graphs
CommunityCut PyTorch overhead with CUDA Graphs.
Authoryo-steven
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This skill helps you reduce PyTorch CPU→GPU launch overhead by capturing GPU workloads once and replaying them efficiently with CUDA Graphs.
Core Features & Use Cases
- API selection for CUDA Graphs in PyTorch: Choose among
torch.compile(mode="reduce-overhead"),torch.cuda.make_graphed_callables(), Transformer Enginemake_graphed_callables, Megatron-LMCudaGraphManager, Megatron-LMFullCudaGraphWrapper, or manualtorch.cuda.graph(). - Compatibility and capture workflows: Guides how to structure capture/replay loops, choose warmup steps, and handle static/dynamic constraints.
- Performance troubleshooting: Diagnoses common speedup blockers (sync points, fragmentation, wrong replay order, RNG/FP8 issues) and provides targeted fixes.
Quick Start
Use the perf-torch-cuda-graphs skill to apply CUDA Graph capture to your PyTorch training loop by selecting the best workflow for your stack (native PyTorch, TE FP8/PP, or Megatron-LM).
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: perf-torch-cuda-graphs Download link: https://github.com/yo-steven/skills-exploration-20260522/archive/main.zip#perf-torch-cuda-graphs Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.