vllm-quantization
OfficialQuantize LLMs for efficient vLLM serving.
Authorair-gapped
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This skill enables efficient quantization workflows for vLLM deployments across Hopper, Blackwell, and ROCm fleets, reducing memory and compute with validated formats like FP8, NVFP4, and MXFP4, plus integration with llm-compressor and NVIDIA ModelOpt.
Core Features & Use Cases
- Supports multiple production quantization paths (FP8, NVFP4, MXFP4, and online quantization) and guidance for model export pipelines.
- Provides end-to-end coverage from PTQ to deployment, including kv cache options and MoE support across vendors.
- Use cases include preparing 70B-class models for datacenter inference, upgrading existing Qwen3/Qi models, and enabling offline model sharing with vendor formats.
Quick Start
Quantize a 70B Hopper model using FP8_DYNAMIC with llm-compressor and load it into vLLM.
Dependency Matrix
Required Modules
None requiredComponents
references
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: vllm-quantization Download link: https://github.com/air-gapped/skills/archive/main.zip#vllm-quantization Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 510,000+ vetted skills library on demand.