hqq-quantization

Official

Compress LLMs for faster inference.

AuthorOrchestra-Research
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the challenge of large language model (LLM) memory and computational requirements by enabling efficient weight quantization, making LLMs faster and more accessible on limited hardware.

Core Features & Use Cases

  • Calibration-Free Quantization: Quantize models to 4/3/2-bit precision without needing calibration datasets, significantly speeding up the quantization process.
  • Optimized Backends: Supports various backends (PyTorch, ATEN, TorchAO, Marlin, BitBlas) for optimized inference performance across different hardware.
  • Framework Integration: Seamlessly integrates with HuggingFace Transformers and vLLM for easy deployment and fine-tuning.
  • Use Case: Deploy a large LLM like Llama-3.1-8B on a consumer GPU by quantizing it to 4-bit using HQQ, enabling faster response times and reduced memory footprint for your application.

Quick Start

Use the hqq-quantization skill to quantize the 'meta-llama/Llama-3.1-8B' model to 4-bit precision.

Dependency Matrix

Required Modules

hqqtorch

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: hqq-quantization
Download link: https://github.com/Orchestra-Research/AI-Research-SKILLs/archive/main.zip#hqq-quantization

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.