What problem does it solve?

This skill provides guidance for solving LLM inference batching and scheduling optimization problems, where requests must be grouped into batches while minimizing cost, padding waste, and latency.

Problem Understanding

Before implementation, thoroughly analyze the problem structure:

Constraint Analysis

Identify all hard constraints - Extract exact limits for:
- Maximum unique shapes allowed (e.g., ≤ 8 shapes across all buckets)
- Latency thresholds (P95, P99)
- Cost budget thresholds
- Padding ratio limits
Compute hard bounds early - Before coding, calculate:
- Minimum possible padding from alignment requirements
- Minimum number of batches required for coverage
- Maximum achievable efficiency given constraints
Decompose the cost function - Understand each component:
- Per-batch overhead (fixed cost per batch)
- Shape compilation costs (often quadratic in sequence length)
- Prefill/decode costs (variable per request)
- Document as: Cost ≈ overhead × num_batches + shape_compile_cost + prefill_cost + decode_cost

Data Analysis

Profile the request distribution - Examine:
- Distribution of prompt lengths (prompt_len)
- Distribution of generation lengths (gen_len)
- Identify outliers that may disproportionately impact metrics
Verify coverage requirements - Ensure:
- The largest prompt_len in each bucket is covered by chosen shapes
- Edge cases with extreme gen_len values are handled

Implementation Approach

Build Reusable Evaluation Infrastructure

Before iterating on parameters, create a systematic evaluation harness:

1. Write a function that takes parameters (shape_list, gen_bucket_sizes) and returns all metrics
2. Include automatic constraint verification with assertions
3. Enable rapid parameter comparison without manual re-runs

Parameter Search Strategy

Avoid random trial-and-error. Instead:

Grid search for small parameter spaces - When parameters are bounded (e.g., gen_bucket_size in [15, 50]), systematically evaluate combinations
Binary search for single parameters - When optimizing one parameter while holding others fixed, use binary search to find optimal values
Document the optimization landscape - Track which parameter combinations produce which metric values to understand trade-offs

Shape Selection Guidelines

When selecting shapes for sequence length bucketing:

Analyze the length distribution - Choose shapes that minimize padding for the most common lengths
** Consider power-of-two or geometric progressions** - These often balance coverage vs. shape count
Account for both buckets jointly - If shapes are shared across buckets, optimize globally not independently

Generation Length Bucketing

The gen_len bucketing parameter significantly impacts padding ratio:

Smaller buckets = Lower padding ratio but more batches (higher cost)
Larger buckets = Fewer batches but higher padding from variance in gen_len
Find the sweet spot by computing the padding budget and working backwards

Verification Strategies

Early Constraint Checks

Immediately after generating output, verify:

1. All request IDs appear exactly once
2. Number of unique shapes ≤ limit
3. Each request's prompt_len ≤ assigned shape
4. No missing shapes that would cause alignment failures

Metric Validation

Before considering a solution complete:

Run the official evaluation script (if provided)
Compare all metrics against all thresholds
Check each bucket independently - passing one bucket does not guarantee passing others

Common Verification Failures

Watch for these issues:

Missing shapes that cause coverage gaps (e.g., shape 2048 missing when needed)
Single-request batches that waste per-batch overhead
Shape constraints violated when optimizing buckets independently

Common Pitfalls

Premature Optimization

Mistake: Jumping into implementation before understanding mathematical constraints
Fix: Spend time upfront computing exact budgets (e.g., "bucket_1 can tolerate at most 25,735 padding tokens")

llm-inference-batching-scheduler

System Documentation

What problem does it solve?

Problem Understanding

Constraint Analysis

Data Analysis

Implementation Approach

Build Reusable Evaluation Infrastructure

Parameter Search Strategy

Shape Selection Guidelines

Generation Length Bucketing

Verification Strategies

Early Constraint Checks

Metric Validation

Common Verification Failures

Common Pitfalls

Premature Optimization

Insufficient Cost Analysis

Independent Bucket Optimization

End of content truncated in source

Dependency Matrix

Required Modules

Components

💻 Claude Code Installation

Agent Skills Search Helper