Write an FP8 Kernel
OfficialFP8 kernel design for Hopper/Ada GPUs.
Authortensormux
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill guides architects and engineers to design and implement FP8 compute kernels for inference and training on NVIDIA Hopper (sm_90) and Ada Lovelace (sm_89) hardware, addressing FP8 format selection, scaling strategy, tensor core usage via WGMMA or cuBLAS, and dequantization epilogue design.
Core Features & Use Cases
- FP8 format mapping and per-tensor / per-channel scaling strategies for weights and activations, including E4M3 for forward paths and E5M2 for gradients.
- Choice of implementation path (cuBLAS FP8, CUTLASS FP8, or custom WGMMA) with guidance on when to prefer each.
- Epilogue design that dequantizes FP8 results back to fp16/bf16/fp32, with proper saturation (satfinite) and fp32 accumulation in the GEMM.
- Validation and benchmarking workflow to compare FP8 results against fp32 references and monitor overflow conditions.
Quick Start
Implement a small FP8 GEMM by selecting E4M3 for inputs, setting per-tensor scaling, using cuBLAS FP8 paths, and verifying results against a fp32 baseline.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: Write an FP8 Kernel Download link: https://github.com/tensormux/kernel-skills/archive/main.zip#write-an-fp8-kernel Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.