gluon-gpr-opt
CommunityReduce GPR usage, accelerate Gluon GEMM.
System Documentation
What problem does it solve?
General-purpose registers (GPRs) pressure and LDS- and MFMA-bound stalls limit performance of Gluon GEMM kernels on CDNA3 and CDNA4 GPUs. This skill provides a structured approach to reduce GPR pressure and improve pipeline overlap by applying a two-stage optimization: Stage 1 loop unroll ×2 to eliminate k%2 modulo overhead, and Stage 2 N-slice tiling to split the B tile into left/right halves, enabling concurrent data movement and computation across gfx942/MI300X/MI308X/MI325X (CDNA3) and gfx950/MI350 (CDNA4).
Core Features & Use Cases
- Stage 1: Loop Unroll ×2 eliminates the k%2 modulo overhead by hardcoding g_idx and l_idx, enabling the compiler to resolve LDS slot accesses statically.
- Stage 2: N-Slice splits the B tile into halves to hide DMA/loads behind MFMA, reducing pipeline bubbles on CDNA4 and providing alternate paths on CDNA3.
- Use cases include large BLOCK_N and MFMA-heavy configurations where maximizing throughput and hiding latencies is critical.
Quick Start
Apply Stage 1 loop unroll ×2 first; if residual bubbles remain, enable Stage 2 N-slice tiling and validate improvements.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: gluon-gpr-opt Download link: https://github.com/leonling-ll/claude-skills/archive/main.zip#gluon-gpr-opt Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.