vllm-tgi-inference
CommunityHigh-throughput local LLM serving with vLLM/TGI
System Documentation
What problem does it solve?
Deploying and operating local LLM inference servers is complex and error-prone due to GPU memory limits, tensor-parallel sharding, quantization formats, batching and KV cache sizing, and differing engine flags. This Skill consolidates server launch patterns, engine decision guidance, quantized model serving, OpenAI-compatible API usage, and a diagnostic checklist to reduce downtime and OOM failures.
Core Features & Use Cases
- Engine Selection: Guidance to choose between vLLM (pip, strong TP and quantization support) and TGI (Docker-native, grammar & watermarking).
- Server Launch & Sharding: Commands and validation checks for single-GPU, multi-GPU tensor parallelism, and Docker-based TGI sharding.
- Quantized Model Serving: Instructions for serving AWQ, GPTQ, GGUF and bitsandbytes formats and their engine-specific flags.
- Performance Tuning & Diagnostics: VRAM estimation, KV cache tuning, batch scheduling, metrics, and OOM troubleshooting steps.
- Client Patterns: OpenAI-compatible Python and curl examples for chat, completion, and streaming clients.
- Use Case: Launch a multi-GPU vLLM server to serve a quantized 70B model with tensor parallelism and monitor Prometheus metrics to tune throughput.
Quick Start
Start a local vLLM server to serve meta-llama/Llama-3.1-8B-Instruct on port 8000 with an OpenAI-compatible API and validate the /v1/models endpoint.
Dependency Matrix
Required Modules
None requiredComponents
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: vllm-tgi-inference Download link: https://github.com/jayll1303/AIEKit/archive/main.zip#vllm-tgi-inference Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.