benchmark-kernel

Community

Benchmark FlashInfer kernels with CUPTI timing.

Authorsunxxuns
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill provides a guided workflow to measure GPU kernel performance for FlashInfer using CUPTI timing, delivering accurate runtime metrics and comparisons across backends.

Core Features & Use Cases

  • CUPTI-based timing: Achieve precise GPU kernel execution times for benchmarking FlashInfer kernels.
  • Backend comparisons: Compare performance across fa2, fa2_tc, cudnn, and trtllm backends.
  • Reproducible results: Generate CSV exports and standardized reports for later analysis.
  • Use cases: Performance profiling, hardware benchmarking, and optimization of kernel configurations.

Quick Start

Install CUPTI-python and CUDA drivers, then run one of the sample benchmarks such as: python benchmarks/flashinfer_benchmark.py
--routine BatchDecodeWithPagedKVCacheWrapper
--backends fa2 fa2_tc cudnn
--page_size 16
--batch_size 32
--s_kv 2048
--num_qo_heads 32
--num_kv_heads 8
--head_dim_qk 128
--head_dim_vo 128
--refcheck
-vv

Or run a smaller test: python benchmarks/flashinfer_benchmark.py
--routine bmm_fp8
--backends cudnn cutlass
--batch_size 64
--m 1 --n 1024 --k 2048
--refcheck

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: benchmark-kernel
Download link: https://github.com/sunxxuns/flashinfer/archive/main.zip#benchmark-kernel

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.