Write a Triton Softmax Kernel

Community

Stable, high-performance Triton softmax

AuthorKrxGu
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Implements a numerically stable, single-program-per-row Triton softmax kernel that avoids separate reduction passes, handles masked attention, and prevents overflow/precision loss during exponentiation and accumulation.

Core Features & Use Cases

  • Numerical stability: casts inputs to fp32 for reductions, subtracts row max before exponentiation, and uses out-of-bounds loads with -inf to avoid phantom contributions.
  • Masked and fused softmax: applies additive masks before reduction and supports fusion with downstream elementwise operations (e.g., scale or multiply-by-V).
  • Multi-block online softmax: handles rows larger than BLOCK_SIZE with running-max/rescaled-sum updates or multi-pass strategies and documents correctness/performance tradeoffs.
  • Use Case: Replace a failing or unfused softmax in attention kernels with a Triton implementation that preserves correctness for large rows and masked sequences while remaining bandwidth-efficient.

Quick Start

Implement a Triton row-wise softmax kernel that uses fp32 accumulation, applies additive masks before max reduction, handles rows larger than BLOCK_SIZE with online max/sum updates, and validates outputs against torch.nn.functional.softmax.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: Write a Triton Softmax Kernel
Download link: https://github.com/KrxGu/kernel-skills/archive/main.zip#write-a-triton-softmax-kernel

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.