vllm-performance-tuning
OfficialFine-tune vLLM for peak inference.
Authorair-gapped
Version1.0.0
Installs0
System Documentation
What problem does it solve?
vLLM performance tuning for MoE and hardware configurations reduces time-to-value in deploying large-scale LLMs by enabling structured, repeatable optimization workflows.
Core Features & Use Cases
- MoE kernel autotuning with benchmark_moe.py to generate tuned configurations and compare performance across token shapes.
- Comprehensive guidance for tensor-parallel, data-parallel, expert-parallel, and pipeline-parallel setups, plus disaggregation patterns (Nixl/Mooncake/LMCache) for scaling.
- End-to-end workflow from baseline to re-bench, including CUDA graphs, compile cache, and scheduler knob tuning to hit SLOs.
Quick Start
Tune MoE kernels on the target GPU and load the generated configs before serving.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: vllm-performance-tuning Download link: https://github.com/air-gapped/skills/archive/main.zip#vllm-performance-tuning Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.