rocm-kernels

Name: rocm-kernels
Availability: InStock
Author: danielfleischer

Community

Benchmark and optimize Triton for ROCm

Software Engineering #benchmarking #triton #diffusers #rmsnorm #rocm #amd-gpu #kernel-injection

Authordanielfleischer

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill helps engineers build and validate optimized Triton kernels for AMD GPUs on ROCm so they can speed up diffusion and transformer pipelines without manual low-level debugging.

Core Features & Use Cases

Kernel optimization guidance: RMSNorm, RoPE 3D, GEGLU, and AdaLN patterns tuned for MI355X (gfx950) and R9700 (gfx1201), including ROCm-specific constraints.
Correctness guardrails: concrete pitfalls like wrong BLOCK_D handling, RoPE cos/sin indexing that crashes for batch>1, and ROCm-missing math functions with safe replacements.
Integration + benchmarking: diffusers injection patterns for LTX-Video (and notes for transformers/Hub workflows) plus micro-benchmarks and end-to-end pipeline timing.
Use case: You’re targeting LTX-Video on ROCm and want faster RMSNorm/attention-related execution by injecting custom Triton kernels, then confirm speedups with repeatable benchmark artifacts.

Quick Start

Run the end-to-end LTX-Video pipeline benchmark in Triton mode with: python scripts/benchmark_e2e.py --mode triton --num-frames 25 --steps 30 --warmup 1 --output-dir examples/ltx-video-benchmark

Dependency Matrix

Required Modules

torchtritondiffuserstransformersacceleratesafetensorshuggingface-hubimageioimageio-ffmpegkernels

Components

scriptsreferences