kernel-benchmark

Name: kernel-benchmark
Availability: InStock
Author: fmh66

Community

Benchmark GPU kernels vs PyTorch baselines.

Software Engineering #performance tuning #gpu #pytorch #triton #cuda events #kernel benchmarking #correctness testing

Authorfmh66

Version1.0.0

Installs0

System Documentation

What problem does it solve?

You need an objective way to compare a custom GPU kernel implementation (CUDA-C++, CUTLASS, CuTe DSL, or Triton) against known PyTorch baselines, while ensuring correctness before you trust any timing results.

Core Features & Use Cases

Correctness-first benchmarking: Validates solution outputs against a required PyTorch eager reference before collecting latency numbers.
Multiple kernel & baseline types: Benchmarks CUDA-C++/CUTLASS shared-library kernels and CuTe DSL/Triton Python implementations, with optional baselines for torch.compile and FlashInfer.
KernelBench-style timing: Uses CUDA event timing by default with warmup, trial control, and L2 cache thrashing to reduce measurement bias.
Actionable report output: Writes a benchmark.md summary including timing distribution stats and speedups versus selected baselines.

Quick Start

Run the benchmark by providing your solution file, a reference module that defines reference(**kwargs), a directory to write benchmark.md, and the kernel dimension/scalar arguments your reference expects.

kernel-benchmark

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper