perf-torch-cuda-graphs

Community

Cut PyTorch overhead with CUDA Graphs.

Authoryo-steven
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This skill helps you reduce PyTorch CPU→GPU launch overhead by capturing GPU workloads once and replaying them efficiently with CUDA Graphs.

Core Features & Use Cases

  • API selection for CUDA Graphs in PyTorch: Choose among torch.compile(mode="reduce-overhead"), torch.cuda.make_graphed_callables(), Transformer Engine make_graphed_callables, Megatron-LM CudaGraphManager, Megatron-LM FullCudaGraphWrapper, or manual torch.cuda.graph().
  • Compatibility and capture workflows: Guides how to structure capture/replay loops, choose warmup steps, and handle static/dynamic constraints.
  • Performance troubleshooting: Diagnoses common speedup blockers (sync points, fragmentation, wrong replay order, RNG/FP8 issues) and provides targeted fixes.

Quick Start

Use the perf-torch-cuda-graphs skill to apply CUDA Graph capture to your PyTorch training loop by selecting the best workflow for your stack (native PyTorch, TE FP8/PP, or Megatron-LM).

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: perf-torch-cuda-graphs
Download link: https://github.com/yo-steven/skills-exploration-20260522/archive/main.zip#perf-torch-cuda-graphs

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.