cuda-kernel-techniques
CommunityFind the right CUDA optimization technique fast.
Software Engineering#bottleneck analysis#memory access#regression debugging#cuda optimization#kernel performance#nc u profiling#occupancy tuning
AuthorRomaosir
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill helps you quickly choose the correct CUDA kernel optimization technique by bottleneck and context, reducing wasted experiments and regression churn.
Core Features & Use Cases
- Bottleneck-Driven Technique Lookup: Quickly route to the right sub-topic (memory access, data placement, parallelism, compute, control flow, occupancy, numerical stability, and anti-patterns) based on profiler symptoms (e.g., DRAM stalls, scoreboard stalls, occupancy limits).
- Hardware-Context Guidance (B200/SM100 + older SMs): Provides hardware notes and thresholds where behavior differs, so you can adapt techniques correctly for Blackwell B200 (SM100) and compare against SM80/SM90 expectations.
- Regression-Safe Debugging: Use the catalog’s dedicated anti-patterns and “when it hurts” sections to avoid known dead-ends when a change regresses.
Quick Start
Use the cuda-kernel-techniques skill to identify the best optimization technique for my kernel when NCU shows DRAM > 70% with long scoreboard stalls, and explain when cp.async and TMA help or hurt on SM100.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: cuda-kernel-techniques Download link: https://github.com/Romaosir/IF_Romao_kernel_optimize/archive/main.zip#cuda-kernel-techniques Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.