Write an FP8 Kernel

Name: Write an FP8 Kernel
Availability: InStock
Author: tensormux

Official

FP8 kernel design for Hopper/Ada GPUs.

Software Engineering #gpu #kernel #quantization #fp8 #fp8-kernel #wgmma

Authortensormux

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill guides architects and engineers to design and implement FP8 compute kernels for inference and training on NVIDIA Hopper (sm_90) and Ada Lovelace (sm_89) hardware, addressing FP8 format selection, scaling strategy, tensor core usage via WGMMA or cuBLAS, and dequantization epilogue design.

Core Features & Use Cases

FP8 format mapping and per-tensor / per-channel scaling strategies for weights and activations, including E4M3 for forward paths and E5M2 for gradients.
Choice of implementation path (cuBLAS FP8, CUTLASS FP8, or custom WGMMA) with guidance on when to prefer each.
Epilogue design that dequantizes FP8 results back to fp16/bf16/fp32, with proper saturation (satfinite) and fp32 accumulation in the GEMM.
Validation and benchmarking workflow to compare FP8 results against fp32 references and monitor overflow conditions.

Quick Start

Implement a small FP8 GEMM by selecting E4M3 for inputs, setting per-tensor scaling, using cuBLAS FP8 paths, and verifying results against a fp32 baseline.

Write an FP8 Kernel

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper