benchmark-kernel

Name: benchmark-kernel
Availability: InStock
Author: sunxxuns

Community

Benchmark FlashInfer kernels with CUPTI timing.

Software Engineering #benchmarking #GPU #CUDA #performance-metrics #FlashInfer #CUPTI #kernel-timing

Authorsunxxuns

Version1.0.0

Installs0

System Documentation

What problem does it solve?

This Skill provides a guided workflow to measure GPU kernel performance for FlashInfer using CUPTI timing, delivering accurate runtime metrics and comparisons across backends.

Core Features & Use Cases

CUPTI-based timing: Achieve precise GPU kernel execution times for benchmarking FlashInfer kernels.
Backend comparisons: Compare performance across fa2, fa2_tc, cudnn, and trtllm backends.
Reproducible results: Generate CSV exports and standardized reports for later analysis.
Use cases: Performance profiling, hardware benchmarking, and optimization of kernel configurations.

Quick Start

Install CUPTI-python and CUDA drivers, then run one of the sample benchmarks such as: python benchmarks/flashinfer_benchmark.py
--routine BatchDecodeWithPagedKVCacheWrapper
--backends fa2 fa2_tc cudnn
--page_size 16
--batch_size 32
--s_kv 2048
--num_qo_heads 32
--num_kv_heads 8
--head_dim_qk 128
--head_dim_vo 128
--refcheck
-vv

Or run a smaller test: python benchmarks/flashinfer_benchmark.py
--routine bmm_fp8
--backends cudnn cutlass
--batch_size 64
--m 1 --n 1024 --k 2048
--refcheck

benchmark-kernel

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper