rocm-kernels
CommunityBenchmark and optimize Triton for ROCm
Authordanielfleischer
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill helps engineers build and validate optimized Triton kernels for AMD GPUs on ROCm so they can speed up diffusion and transformer pipelines without manual low-level debugging.
Core Features & Use Cases
- Kernel optimization guidance: RMSNorm, RoPE 3D, GEGLU, and AdaLN patterns tuned for MI355X (gfx950) and R9700 (gfx1201), including ROCm-specific constraints.
- Correctness guardrails: concrete pitfalls like wrong BLOCK_D handling, RoPE cos/sin indexing that crashes for batch>1, and ROCm-missing math functions with safe replacements.
- Integration + benchmarking: diffusers injection patterns for LTX-Video (and notes for transformers/Hub workflows) plus micro-benchmarks and end-to-end pipeline timing.
- Use case: You’re targeting LTX-Video on ROCm and want faster RMSNorm/attention-related execution by injecting custom Triton kernels, then confirm speedups with repeatable benchmark artifacts.
Quick Start
Run the end-to-end LTX-Video pipeline benchmark in Triton mode with: python scripts/benchmark_e2e.py --mode triton --num-frames 25 --steps 30 --warmup 1 --output-dir examples/ltx-video-benchmark
Dependency Matrix
Required Modules
torchtritondiffuserstransformersacceleratesafetensorshuggingface-hubimageioimageio-ffmpegkernels
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: rocm-kernels Download link: https://github.com/danielfleischer/kernels2/archive/main.zip#rocm-kernels Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.