llm-inference-batching-scheduler
OfficialOptimize LLM batching for cost and latency.
System Documentation
What problem does it solve?
This skill provides guidance for solving LLM inference batching and scheduling optimization problems, where requests must be grouped into batches while minimizing cost, padding waste, and latency.
Problem Understanding
Before implementation, thoroughly analyze the problem structure:
Constraint Analysis
-
Identify all hard constraints - Extract exact limits for:
- Maximum unique shapes allowed (e.g., ≤ 8 shapes across all buckets)
- Latency thresholds (P95, P99)
- Cost budget thresholds
- Padding ratio limits
-
Compute hard bounds early - Before coding, calculate:
- Minimum possible padding from alignment requirements
- Minimum number of batches required for coverage
- Maximum achievable efficiency given constraints
-
Decompose the cost function - Understand each component:
- Per-batch overhead (fixed cost per batch)
- Shape compilation costs (often quadratic in sequence length)
- Prefill/decode costs (variable per request)
- Document as:
Cost ≈ overhead × num_batches + shape_compile_cost + prefill_cost + decode_cost
Data Analysis
-
Profile the request distribution - Examine:
- Distribution of prompt lengths (prompt_len)
- Distribution of generation lengths (gen_len)
- Identify outliers that may disproportionately impact metrics
-
Verify coverage requirements - Ensure:
- The largest prompt_len in each bucket is covered by chosen shapes
- Edge cases with extreme gen_len values are handled
Implementation Approach
Build Reusable Evaluation Infrastructure
Before iterating on parameters, create a systematic evaluation harness:
1. Write a function that takes parameters (shape_list, gen_bucket_sizes) and returns all metrics
2. Include automatic constraint verification with assertions
3. Enable rapid parameter comparison without manual re-runs
Parameter Search Strategy
Avoid random trial-and-error. Instead:
-
Grid search for small parameter spaces - When parameters are bounded (e.g., gen_bucket_size in [15, 50]), systematically evaluate combinations
-
Binary search for single parameters - When optimizing one parameter while holding others fixed, use binary search to find optimal values
-
Document the optimization landscape - Track which parameter combinations produce which metric values to understand trade-offs
Shape Selection Guidelines
When selecting shapes for sequence length bucketing:
- Analyze the length distribution - Choose shapes that minimize padding for the most common lengths
- ** Consider power-of-two or geometric progressions** - These often balance coverage vs. shape count
- Account for both buckets jointly - If shapes are shared across buckets, optimize globally not independently
Generation Length Bucketing
The gen_len bucketing parameter significantly impacts padding ratio:
- Smaller buckets = Lower padding ratio but more batches (higher cost)
- Larger buckets = Fewer batches but higher padding from variance in gen_len
- Find the sweet spot by computing the padding budget and working backwards
Verification Strategies
Early Constraint Checks
Immediately after generating output, verify:
1. All request IDs appear exactly once
2. Number of unique shapes ≤ limit
3. Each request's prompt_len ≤ assigned shape
4. No missing shapes that would cause alignment failures
Metric Validation
Before considering a solution complete:
- Run the official evaluation script (if provided)
- Compare all metrics against all thresholds
- Check each bucket independently - passing one bucket does not guarantee passing others
Common Verification Failures
Watch for these issues:
- Missing shapes that cause coverage gaps (e.g., shape 2048 missing when needed)
- Single-request batches that waste per-batch overhead
- Shape constraints violated when optimizing buckets independently
Common Pitfalls
Premature Optimization
- Mistake: Jumping into implementation before understanding mathematical constraints
- Fix: Spend time upfront computing exact budgets (e.g., "bucket_1 can tolerate at most 25,735 padding tokens")
Insufficient Cost Analysis
- Mistake: Not understanding which cost component dominates
- Fix: Compute and document the full cost breakdown before optimizing
Independent Bucket Optimization
- Mistake: Optimizing each bucket separately when constrain
End of content truncated in source
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: llm-inference-batching-scheduler Download link: https://github.com/letta-ai/skills/archive/main.zip#llm-inference-batching-scheduler Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 471,000+ vetted skills library on demand.