kernel-benchmark

Community

Benchmark GPU kernels vs PyTorch baselines.

Authorfmh66
Version1.0.0
Installs0

System Documentation

What problem does it solve?

You need an objective way to compare a custom GPU kernel implementation (CUDA-C++, CUTLASS, CuTe DSL, or Triton) against known PyTorch baselines, while ensuring correctness before you trust any timing results.

Core Features & Use Cases

  • Correctness-first benchmarking: Validates solution outputs against a required PyTorch eager reference before collecting latency numbers.
  • Multiple kernel & baseline types: Benchmarks CUDA-C++/CUTLASS shared-library kernels and CuTe DSL/Triton Python implementations, with optional baselines for torch.compile and FlashInfer.
  • KernelBench-style timing: Uses CUDA event timing by default with warmup, trial control, and L2 cache thrashing to reduce measurement bias.
  • Actionable report output: Writes a benchmark.md summary including timing distribution stats and speedups versus selected baselines.

Quick Start

Run the benchmark by providing your solution file, a reference module that defines reference(**kwargs), a directory to write benchmark.md, and the kernel dimension/scalar arguments your reference expects.

Dependency Matrix

Required Modules

torchtritonnvidia-cutlass-dslflashinfer-python

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: kernel-benchmark
Download link: https://github.com/fmh66/kernel-opt-agent/archive/main.zip#kernel-benchmark

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.