Write CUDA Softmax Kernel

Community

Numerically stable, high-performance CUDA softmax

AuthorKrxGu
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Implements a correct, numerically stable, and high-performance CUDA softmax along the last dimension for 2D/3D tensors, preventing overflow/underflow, handling masked rows, and supporting fp16/bf16 inputs with fp32 accumulation to avoid common softmax correctness bugs in attention and kernel-fusion scenarios.

Core Features & Use Cases

  • Numerical stability: always subtracts the row maximum before exponentiation and accumulates in fp32 for low-precision inputs.
  • Masked softmax: supports additive and boolean masks with explicit handling for fully-masked rows and documented fallback behavior.
  • Decomposition & performance: guidance for warp/block assignment, warp-level reductions, optional online (single-pass) or two-pass algorithms, and host dispatch parameters for common attention shapes.
  • Precision & backward pass: fp16/bf16 input handling with fp32 accumulation and a reusable reduction pattern for the backward gradient computation.
  • Use Case: fuse softmax into attention kernels for low memory footprint flash-attention patterns or implement a standalone masked softmax when library routines are insufficient.

Quick Start

Ask the agent to generate a CUDA kernel that computes numerically stable row-wise softmax with optional additive or boolean masking, using fp32 accumulation for fp16/bf16 inputs and warp+block reductions for performance.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: Write CUDA Softmax Kernel
Download link: https://github.com/KrxGu/kernel-skills/archive/main.zip#write-cuda-softmax-kernel

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.