llm-serving-performance-tuning

Community

Tune vLLM and gateway to hit p99 safely.

Authorsaintgo7
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill fixes LLM serving bottlenecks by ensuring you measure the real constraint across vLLM, the FastAPI gateway, and the database instead of guessing based only on GPU utilization.

Core Features & Use Cases

  • 6-step, evidence-driven tuning workflow: baseline load testing, bottleneck classification, vLLM parameter tuning, gateway tuning, DB tuning, then re-measuring to confirm impact.
  • Production-focused diagnostics: maps symptoms like queue backlog, 429/5xx distributions, sustained GPU underutilization, and DB pool exhaustion to the correct tuning step.
  • Concurrency validation for GEM-LLM-style workloads: guides performance decisions using multi-user scenarios (e.g., 50/100/200 concurrent users) with tok/s, RPS, and p99 latency targets.

Quick Start

Run the skill install entrypoint to apply the 6-step tuning workflow for your vLLM + FastAPI + DB deployment: ./install.sh llm-serving-performance-tuning

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: llm-serving-performance-tuning
Download link: https://github.com/saintgo7/claude-skills/archive/main.zip#llm-serving-performance-tuning

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.