b200-moe-kernel-optimization
CommunityTurn PyTorch MoE into fast, correct B200 kernels.
System Documentation
What problem does it solve?
This Skill helps you optimize Mixture-of-Experts CUDA kernels for NVIDIA B200 (Blackwell, SM100) so you achieve major speedups while maintaining correctness, especially for FP8/FP16 grouped GEMM pipelines with routing, gather/scatter, SwiGLU, and Blackwell-specific primitives.
Core Features & Use Cases
- Correctness-first optimization for MoE FP8: establishes a cuBLAS FP16 oracle baseline, uses strict failure-mode debugging for FP8 correctness, and avoids known-bad approaches documented in the dead-ends catalog.
- B200-specific optimization ladder with measurable ROI: guides you through an ordered sequence (cuBLAS FP16 → CUTLASS FP8 grouped GEMM → zero-sync fast path → dual-tile dispatch → static compile → tcgen05), including plateau rules and regression discipline.
- Backend selection + dispatch guardrails: covers CUTLASS vs tcgen05 decision points, T-dependent GEMM2 backend switching, and validation against silent fallbacks using NCU kernel-name checks.
- Production-style team workflow: recommends role separation (Planner/Implementer/Profiler) and GPU isolation rules to keep measurements trustworthy across multi-round tuning.
Quick Start
Use the b200-moe-kernel-optimization skill to migrate your PyTorch Mixture-of-Experts reference to a B200-optimized CUDA implementation by following the optimization ladder and using cuBLAS FP16 as the correctness oracle on each change.
Dependency Matrix
Required Modules
None requiredComponents
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: b200-moe-kernel-optimization Download link: https://github.com/Romaosir/IF_Romao_kernel_optimize/archive/main.zip#b200-moe-kernel-optimization Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.