vllm-tgi-inference

Name: vllm-tgi-inference
Availability: InStock
Author: jayll1303

Community

High-throughput local LLM serving with vLLM/TGI

Software Engineering #quantization #vllm #kv-cache #llm-serving #tensor-parallelism #tgi

Authorjayll1303

Version1.0.0

Installs0

System Documentation

What problem does it solve?

Deploying and operating local LLM inference servers is complex and error-prone due to GPU memory limits, tensor-parallel sharding, quantization formats, batching and KV cache sizing, and differing engine flags. This Skill consolidates server launch patterns, engine decision guidance, quantized model serving, OpenAI-compatible API usage, and a diagnostic checklist to reduce downtime and OOM failures.

Core Features & Use Cases

Engine Selection: Guidance to choose between vLLM (pip, strong TP and quantization support) and TGI (Docker-native, grammar & watermarking).
Server Launch & Sharding: Commands and validation checks for single-GPU, multi-GPU tensor parallelism, and Docker-based TGI sharding.
Quantized Model Serving: Instructions for serving AWQ, GPTQ, GGUF and bitsandbytes formats and their engine-specific flags.
Performance Tuning & Diagnostics: VRAM estimation, KV cache tuning, batch scheduling, metrics, and OOM troubleshooting steps.
Client Patterns: OpenAI-compatible Python and curl examples for chat, completion, and streaming clients.
Use Case: Launch a multi-GPU vLLM server to serve a quantized 70B model with tensor parallelism and monitor Prometheus metrics to tune throughput.

Quick Start

Start a local vLLM server to serve meta-llama/Llama-3.1-8B-Instruct on port 8000 with an OpenAI-compatible API and validate the /v1/models endpoint.

vllm-tgi-inference

System Documentation

What problem does it solve?

Core Features & Use Cases

Quick Start

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper