benchmark-kernel
CommunityBenchmark FlashInfer kernels with CUPTI timing.
System Documentation
What problem does it solve?
This Skill provides a guided workflow to measure GPU kernel performance for FlashInfer using CUPTI timing, delivering accurate runtime metrics and comparisons across backends.
Core Features & Use Cases
- CUPTI-based timing: Achieve precise GPU kernel execution times for benchmarking FlashInfer kernels.
- Backend comparisons: Compare performance across fa2, fa2_tc, cudnn, and trtllm backends.
- Reproducible results: Generate CSV exports and standardized reports for later analysis.
- Use cases: Performance profiling, hardware benchmarking, and optimization of kernel configurations.
Quick Start
Install CUPTI-python and CUDA drivers, then run one of the sample benchmarks such as:
python benchmarks/flashinfer_benchmark.py
--routine BatchDecodeWithPagedKVCacheWrapper
--backends fa2 fa2_tc cudnn
--page_size 16
--batch_size 32
--s_kv 2048
--num_qo_heads 32
--num_kv_heads 8
--head_dim_qk 128
--head_dim_vo 128
--refcheck
-vv
Or run a smaller test:
python benchmarks/flashinfer_benchmark.py
--routine bmm_fp8
--backends cudnn cutlass
--batch_size 64
--m 1 --n 1024 --k 2048
--refcheck
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: benchmark-kernel Download link: https://github.com/sunxxuns/flashinfer/archive/main.zip#benchmark-kernel Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.