Write an FP8 Kernel

Official

FP8 kernel design for Hopper/Ada GPUs.

Authortensormux
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill guides architects and engineers to design and implement FP8 compute kernels for inference and training on NVIDIA Hopper (sm_90) and Ada Lovelace (sm_89) hardware, addressing FP8 format selection, scaling strategy, tensor core usage via WGMMA or cuBLAS, and dequantization epilogue design.

Core Features & Use Cases

  • FP8 format mapping and per-tensor / per-channel scaling strategies for weights and activations, including E4M3 for forward paths and E5M2 for gradients.
  • Choice of implementation path (cuBLAS FP8, CUTLASS FP8, or custom WGMMA) with guidance on when to prefer each.
  • Epilogue design that dequantizes FP8 results back to fp16/bf16/fp32, with proper saturation (satfinite) and fp32 accumulation in the GEMM.
  • Validation and benchmarking workflow to compare FP8 results against fp32 references and monitor overflow conditions.

Quick Start

Implement a small FP8 GEMM by selecting E4M3 for inputs, setting per-tensor scaling, using cuBLAS FP8 paths, and verifying results against a fp32 baseline.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: Write an FP8 Kernel
Download link: https://github.com/tensormux/kernel-skills/archive/main.zip#write-an-fp8-kernel

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.