Tangram — Per-head KV cache budgets
LLMThe news. On June 4, 2026, researchers posted Tangram, a system for multi-turn LLM serving built on a single observation: KV-cache importance is non-uniform across attention heads, so giving every head the same cache budget wastes memory. Tangram has three parts — Deterministic Budget Allocation fixes each head's KV size from its inherent retention pattern; Head-Group Paging clusters heads with similar needs behind independent vectorized page tables; and Ahead-of-Time Load Balancing plans GPU memory from those static profiles before runtime. It reports up to 2.6× serving throughput over existing baselines while fully preserving model accuracy. Read the paper →
Picture boarding a plane where every passenger is handed the same full-size overhead bin, whether they brought a rolling suitcase or a paperback. The heavy travelers fill theirs; the light ones leave a bin that is almost entirely air. The cabin runs out of bin space long before it runs out of seats — not because the bags are big, but because the bins were all sized for the biggest bag. Tangram's whole move is to size each bin to the bag it actually holds, then slide the freed shelf space toward letting more passengers board.
Under the metaphor, each passenger is an attention head and the carry-on is the KV cache that head keeps. Heads are not interchangeable: some attend broadly across the whole conversation and genuinely need a large cache; many are local or act as attention sinks, keeping only the last few tokens or a single anchor. A uniform budget has to be sized for the greediest head, so every light head drags around the same oversized allocation. Because the KV cache is what dominates serving memory, that padding is exactly the resource you can least afford to waste.
Tangram replaces the one-size budget with a deterministic per-head budget read off each head's retention pattern — the same way grouped-query attention already exploits the fact that heads can share keys and values rather than each carrying a full copy. It then clusters heads with similar budgets behind shared page tables, and because every budget is fixed up front, it can plan GPU memory ahead of time instead of scheduling it on the fly. The freed memory is not the goal in itself — it is headroom: smaller per-request caches mean more requests fit before the cache caps the batch, and a bigger batch is what turns into throughput.
Here is where it earns its keep, with illustrative numbers. Take a layer with 6 attention heads. A uniform budget sizes every head to the most demanding one — say 100 units of KV each — so the layer holds 6 × 100 = 600 units per request. Now measure what each head actually keeps: a broad head needs ~95, the next ~82, a mixed head ~55, and the three local/sink heads only ~30, ~18, and ~12. Those sum to about 292 units — so the uniform budget spends 600 to do the work of 292, leaving roughly 51% of the KV cache as padding. Reclaim that padding and nearly twice as many requests' caches fit in the same HBM, which is the headroom behind the reported up to 2.6× throughput.
| Aspect | Uniform per-head budget | Tangram (per-head) |
|---|---|---|
| Cache size per head | same for all, sized to the greediest head | deterministic, sized to each head's retention pattern |
| Wasted KV on light heads | large — the full budget minus what the head keeps | ~half the KV freed overall (illustrative, setup-dependent) |
| Memory management | one shared layout, dynamically scheduled | head groups behind independent vectorized page tables |
| GPU memory planning | at runtime, per step | ahead-of-time from static budget profiles |
| Multi-turn throughput | baseline | up to 2.6× (Tangram paper) |
| Model accuracy | — | fully preserved (Tangram paper) |
The catch is that a per-head budget is only safe if you can tell, ahead of time, how much each head truly needs — guess too small and you evict KV a broad head still wants, hurting quality. Tangram's claim is that a head's retention pattern is inherent and stable enough to fix the budget deterministically, which is what lets it reclaim the memory and keep accuracy. That makes it complementary to the other ways of shrinking the cache: quantizing each KV entry to fewer bits, or pruning which tokens to keep, attack a different axis than how big each head's budget should be in the first place.
Goes deeper in: LLM Internals → KV Cache → KV Cache Memory
Related explainers
- SP-KV — self-pruned KV cache — the orthogonal axis: deciding which tokens to drop from the cache, not how big each head's budget is
- KVARN — Hadamard 2-bit KV cache — shrinking the cache by storing each KV entry in fewer bits rather than fewer entries
- SGLang v0.5.12 — MLA token-speed — multi-head latent attention, another route to a smaller per-head KV footprint