rocm-kernels

Community

Benchmark and optimize Triton for ROCm

Authordanielfleischer
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill helps engineers build and validate optimized Triton kernels for AMD GPUs on ROCm so they can speed up diffusion and transformer pipelines without manual low-level debugging.

Core Features & Use Cases

  • Kernel optimization guidance: RMSNorm, RoPE 3D, GEGLU, and AdaLN patterns tuned for MI355X (gfx950) and R9700 (gfx1201), including ROCm-specific constraints.
  • Correctness guardrails: concrete pitfalls like wrong BLOCK_D handling, RoPE cos/sin indexing that crashes for batch>1, and ROCm-missing math functions with safe replacements.
  • Integration + benchmarking: diffusers injection patterns for LTX-Video (and notes for transformers/Hub workflows) plus micro-benchmarks and end-to-end pipeline timing.
  • Use case: You’re targeting LTX-Video on ROCm and want faster RMSNorm/attention-related execution by injecting custom Triton kernels, then confirm speedups with repeatable benchmark artifacts.

Quick Start

Run the end-to-end LTX-Video pipeline benchmark in Triton mode with: python scripts/benchmark_e2e.py --mode triton --num-frames 25 --steps 30 --warmup 1 --output-dir examples/ltx-video-benchmark

Dependency Matrix

Required Modules

torchtritondiffuserstransformersacceleratesafetensorshuggingface-hubimageioimageio-ffmpegkernels

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: rocm-kernels
Download link: https://github.com/danielfleischer/kernels2/archive/main.zip#rocm-kernels

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.