kvcache-optimization

Community

Streamline KVCache for faster model inference.

Authortuliang1024
Version1.0.0
Installs0

System Documentation

What problem does it solve?

KVCache optimization for LLM/ multimodal inference to reduce memory usage and improve throughput, by selecting and combining cache strategies such as continuous caching, paging attention with FA fusion operators, and MLA-based compression. It outlines trigger scenarios including KVCache management, paging attention, KV compression, FA fusion operators, OOM/performance issues, and block_table/slot_mapping construction, offering model-type and scenario-based recommendations.

Core Features & Use Cases

  • Guidance on choosing KVCache modes for standard LLMs, high-performance LLMs, MLA architectures, and diffusion/video models.
  • Detailed implementation patterns: mode1 continuous caching, mode2 paging attention with block_table/slot_mapping, mode3 MLA cache compression, and how to compose them.
  • Real-world use case examples showing how to deploy KVCache strategies to reduce memory footprint while maintaining throughput.

Quick Start

Execute the KVCache optimization by selecting a mode (continuous cache, paging attention with FA, or MLA compression) and implement the corresponding KV cache storage, block_table mapping, and fused attention flow for your model.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: kvcache-optimization
Download link: https://github.com/tuliang1024/cann-recipes-infer/archive/main.zip#kvcache-optimization

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.