vllm-tgi-inference

Community

High-throughput local LLM serving with vLLM/TGI

Authorjayll1303
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Deploying and operating local LLM inference servers is complex and error-prone due to GPU memory limits, tensor-parallel sharding, quantization formats, batching and KV cache sizing, and differing engine flags. This Skill consolidates server launch patterns, engine decision guidance, quantized model serving, OpenAI-compatible API usage, and a diagnostic checklist to reduce downtime and OOM failures.

Core Features & Use Cases

  • Engine Selection: Guidance to choose between vLLM (pip, strong TP and quantization support) and TGI (Docker-native, grammar & watermarking).
  • Server Launch & Sharding: Commands and validation checks for single-GPU, multi-GPU tensor parallelism, and Docker-based TGI sharding.
  • Quantized Model Serving: Instructions for serving AWQ, GPTQ, GGUF and bitsandbytes formats and their engine-specific flags.
  • Performance Tuning & Diagnostics: VRAM estimation, KV cache tuning, batch scheduling, metrics, and OOM troubleshooting steps.
  • Client Patterns: OpenAI-compatible Python and curl examples for chat, completion, and streaming clients.
  • Use Case: Launch a multi-GPU vLLM server to serve a quantized 70B model with tensor parallelism and monitor Prometheus metrics to tune throughput.

Quick Start

Start a local vLLM server to serve meta-llama/Llama-3.1-8B-Instruct on port 8000 with an OpenAI-compatible API and validate the /v1/models endpoint.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: vllm-tgi-inference
Download link: https://github.com/jayll1303/AIEKit/archive/main.zip#vllm-tgi-inference

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.