b200-moe-kernel-optimization

Community

Turn PyTorch MoE into fast, correct B200 kernels.

AuthorRomaosir
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps you optimize Mixture-of-Experts CUDA kernels for NVIDIA B200 (Blackwell, SM100) so you achieve major speedups while maintaining correctness, especially for FP8/FP16 grouped GEMM pipelines with routing, gather/scatter, SwiGLU, and Blackwell-specific primitives.

Core Features & Use Cases

  • Correctness-first optimization for MoE FP8: establishes a cuBLAS FP16 oracle baseline, uses strict failure-mode debugging for FP8 correctness, and avoids known-bad approaches documented in the dead-ends catalog.
  • B200-specific optimization ladder with measurable ROI: guides you through an ordered sequence (cuBLAS FP16 → CUTLASS FP8 grouped GEMM → zero-sync fast path → dual-tile dispatch → static compile → tcgen05), including plateau rules and regression discipline.
  • Backend selection + dispatch guardrails: covers CUTLASS vs tcgen05 decision points, T-dependent GEMM2 backend switching, and validation against silent fallbacks using NCU kernel-name checks.
  • Production-style team workflow: recommends role separation (Planner/Implementer/Profiler) and GPU isolation rules to keep measurements trustworthy across multi-round tuning.

Quick Start

Use the b200-moe-kernel-optimization skill to migrate your PyTorch Mixture-of-Experts reference to a B200-optimized CUDA implementation by following the optimization ladder and using cuBLAS FP16 as the correctness oracle on each change.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: b200-moe-kernel-optimization
Download link: https://github.com/Romaosir/IF_Romao_kernel_optimize/archive/main.zip#b200-moe-kernel-optimization

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.