Write a Triton Softmax Kernel
CommunityStable, high-performance Triton softmax
Software Engineering#softmax#triton#numerical-stability#gpu-kernel#kernel-optimization#masked-softmax
AuthorKrxGu
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Implements a numerically stable, single-program-per-row Triton softmax kernel that avoids separate reduction passes, handles masked attention, and prevents overflow/precision loss during exponentiation and accumulation.
Core Features & Use Cases
- Numerical stability: casts inputs to fp32 for reductions, subtracts row max before exponentiation, and uses out-of-bounds loads with -inf to avoid phantom contributions.
- Masked and fused softmax: applies additive masks before reduction and supports fusion with downstream elementwise operations (e.g., scale or multiply-by-V).
- Multi-block online softmax: handles rows larger than BLOCK_SIZE with running-max/rescaled-sum updates or multi-pass strategies and documents correctness/performance tradeoffs.
- Use Case: Replace a failing or unfused softmax in attention kernels with a Triton implementation that preserves correctness for large rows and masked sequences while remaining bandwidth-efficient.
Quick Start
Implement a Triton row-wise softmax kernel that uses fp32 accumulation, applies additive masks before max reduction, handles rows larger than BLOCK_SIZE with online max/sum updates, and validates outputs against torch.nn.functional.softmax.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: Write a Triton Softmax Kernel Download link: https://github.com/KrxGu/kernel-skills/archive/main.zip#write-a-triton-softmax-kernel Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.