vllm-performance-tuning

Official

Fine-tune vLLM for peak inference.

Authorair-gapped
Version1.0.0
Installs0

System Documentation

What problem does it solve?

vLLM performance tuning for MoE and hardware configurations reduces time-to-value in deploying large-scale LLMs by enabling structured, repeatable optimization workflows.

Core Features & Use Cases

  • MoE kernel autotuning with benchmark_moe.py to generate tuned configurations and compare performance across token shapes.
  • Comprehensive guidance for tensor-parallel, data-parallel, expert-parallel, and pipeline-parallel setups, plus disaggregation patterns (Nixl/Mooncake/LMCache) for scaling.
  • End-to-end workflow from baseline to re-bench, including CUDA graphs, compile cache, and scheduler knob tuning to hit SLOs.

Quick Start

Tune MoE kernels on the target GPU and load the generated configs before serving.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: vllm-performance-tuning
Download link: https://github.com/air-gapped/skills/archive/main.zip#vllm-performance-tuning

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.