Write CUDA Softmax Kernel
CommunityNumerically stable, high-performance CUDA softmax
System Documentation
What problem does it solve?
Implements a correct, numerically stable, and high-performance CUDA softmax along the last dimension for 2D/3D tensors, preventing overflow/underflow, handling masked rows, and supporting fp16/bf16 inputs with fp32 accumulation to avoid common softmax correctness bugs in attention and kernel-fusion scenarios.
Core Features & Use Cases
- Numerical stability: always subtracts the row maximum before exponentiation and accumulates in fp32 for low-precision inputs.
- Masked softmax: supports additive and boolean masks with explicit handling for fully-masked rows and documented fallback behavior.
- Decomposition & performance: guidance for warp/block assignment, warp-level reductions, optional online (single-pass) or two-pass algorithms, and host dispatch parameters for common attention shapes.
- Precision & backward pass: fp16/bf16 input handling with fp32 accumulation and a reusable reduction pattern for the backward gradient computation.
- Use Case: fuse softmax into attention kernels for low memory footprint flash-attention patterns or implement a standalone masked softmax when library routines are insufficient.
Quick Start
Ask the agent to generate a CUDA kernel that computes numerically stable row-wise softmax with optional additive or boolean masking, using fp32 accumulation for fp16/bf16 inputs and warp+block reductions for performance.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: Write CUDA Softmax Kernel Download link: https://github.com/KrxGu/kernel-skills/archive/main.zip#write-cuda-softmax-kernel Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.