cuda-roofline-strategy
CommunityPick the next CUDA optimization with confidence.
Software Engineering#performance tuning#cuda#occupancy#kernel optimization#roofline#nvidia ncu#warp stalls
AuthorRomaosir
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill solves the problem of wasting GPU-kernel optimization experiments by choosing the next technique class without a principled diagnosis of what is currently limiting performance.
Core Features & Use Cases
- Classifies roofline bottlenecks (compute-bound, bandwidth-bound, occupancy-limited, latency-bound, balanced) from NCU metrics like SM throughput, DRAM/SOL throughput, achieved occupancy, and top warp stall reason.
- Selects a technique tier using iteration phase (early, mid, late, plateau) so recommendations scale with “how desperate” the run is and whether heavy rewrites are warranted.
- Supports per-workload refinement to avoid misleading averages by re-profiling slow outlier workloads and adapting dispatch when regimes differ.
Quick Start
Use this Skill when you have a fresh NCU profile and need to decide which category of CUDA kernel optimization to try next based on the current roofline position and your iteration phase.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: cuda-roofline-strategy Download link: https://github.com/Romaosir/IF_Romao_kernel_optimize/archive/main.zip#cuda-roofline-strategy Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.