cuda-kernel-techniques

Community

Find the right CUDA optimization technique fast.

AuthorRomaosir
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps you quickly choose the correct CUDA kernel optimization technique by bottleneck and context, reducing wasted experiments and regression churn.

Core Features & Use Cases

  • Bottleneck-Driven Technique Lookup: Quickly route to the right sub-topic (memory access, data placement, parallelism, compute, control flow, occupancy, numerical stability, and anti-patterns) based on profiler symptoms (e.g., DRAM stalls, scoreboard stalls, occupancy limits).
  • Hardware-Context Guidance (B200/SM100 + older SMs): Provides hardware notes and thresholds where behavior differs, so you can adapt techniques correctly for Blackwell B200 (SM100) and compare against SM80/SM90 expectations.
  • Regression-Safe Debugging: Use the catalog’s dedicated anti-patterns and “when it hurts” sections to avoid known dead-ends when a change regresses.

Quick Start

Use the cuda-kernel-techniques skill to identify the best optimization technique for my kernel when NCU shows DRAM > 70% with long scoreboard stalls, and explain when cp.async and TMA help or hurt on SM100.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: cuda-kernel-techniques
Download link: https://github.com/Romaosir/IF_Romao_kernel_optimize/archive/main.zip#cuda-kernel-techniques

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.