kvcache-optimization
CommunityStreamline KVCache for faster model inference.
System Documentation
What problem does it solve?
KVCache optimization for LLM/ multimodal inference to reduce memory usage and improve throughput, by selecting and combining cache strategies such as continuous caching, paging attention with FA fusion operators, and MLA-based compression. It outlines trigger scenarios including KVCache management, paging attention, KV compression, FA fusion operators, OOM/performance issues, and block_table/slot_mapping construction, offering model-type and scenario-based recommendations.
Core Features & Use Cases
- Guidance on choosing KVCache modes for standard LLMs, high-performance LLMs, MLA architectures, and diffusion/video models.
- Detailed implementation patterns: mode1 continuous caching, mode2 paging attention with block_table/slot_mapping, mode3 MLA cache compression, and how to compose them.
- Real-world use case examples showing how to deploy KVCache strategies to reduce memory footprint while maintaining throughput.
Quick Start
Execute the KVCache optimization by selecting a mode (continuous cache, paging attention with FA, or MLA compression) and implement the corresponding KV cache storage, block_table mapping, and fused attention flow for your model.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: kvcache-optimization Download link: https://github.com/tuliang1024/cann-recipes-infer/archive/main.zip#kvcache-optimization Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.