kernel-benchmark
CommunityBenchmark GPU kernels vs PyTorch baselines.
Software Engineering#performance tuning#gpu#pytorch#triton#cuda events#kernel benchmarking#correctness testing
Authorfmh66
Version1.0.0
Installs0
System Documentation
What problem does it solve?
You need an objective way to compare a custom GPU kernel implementation (CUDA-C++, CUTLASS, CuTe DSL, or Triton) against known PyTorch baselines, while ensuring correctness before you trust any timing results.
Core Features & Use Cases
- Correctness-first benchmarking: Validates solution outputs against a required PyTorch eager reference before collecting latency numbers.
- Multiple kernel & baseline types: Benchmarks CUDA-C++/CUTLASS shared-library kernels and CuTe DSL/Triton Python implementations, with optional baselines for torch.compile and FlashInfer.
- KernelBench-style timing: Uses CUDA event timing by default with warmup, trial control, and L2 cache thrashing to reduce measurement bias.
- Actionable report output: Writes a benchmark.md summary including timing distribution stats and speedups versus selected baselines.
Quick Start
Run the benchmark by providing your solution file, a reference module that defines reference(**kwargs), a directory to write benchmark.md, and the kernel dimension/scalar arguments your reference expects.
Dependency Matrix
Required Modules
torchtritonnvidia-cutlass-dslflashinfer-python
Components
scripts
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: kernel-benchmark Download link: https://github.com/fmh66/kernel-opt-agent/archive/main.zip#kernel-benchmark Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.