Tangram — Per-head KV cache budgets

LLM
L
KV importance is non-uniform across heads — size each head's budget to what it keepsreal KV the head keepswasted by uniform budgetKV per request100%Head 1 · broadHead 2 · broadHead 3 · mixedHead 4 · localHead 5 · localHead 6 · sinkretention pattern — past tokens this head keeps← real KV (green) differs per headuniform budget pads every head to the widest → wasted (pink)Uniform budget → per-head budgets sized to each head's retention100%~49%KV cache per request — about half freed (illustrative)up to 2.6× multi-turn serving throughputmodel accuracy fully preservedthroughput & accuracy: Tangram paper · per-head budgets & memory % illustrative
learnaivisually.com/ai-explained/tangram-per-head-kv-budgets

The news. On June 4, 2026, researchers posted Tangram, a system for multi-turn LLM serving built on a single observation: KV-cache importance is non-uniform across attention heads, so giving every head the same cache budget wastes memory. Tangram has three parts — Deterministic Budget Allocation fixes each head's KV size from its inherent retention pattern; Head-Group Paging clusters heads with similar needs behind independent vectorized page tables; and Ahead-of-Time Load Balancing plans GPU memory from those static profiles before runtime. It reports up to 2.6× serving throughput over existing baselines while fully preserving model accuracy. Read the paper →

Picture boarding a plane where every passenger is handed the same full-size overhead bin, whether they brought a rolling suitcase or a paperback. The heavy travelers fill theirs; the light ones leave a bin that is almost entirely air. The cabin runs out of bin space long before it runs out of seats — not because the bags are big, but because the bins were all sized for the biggest bag. Tangram's whole move is to size each bin to the bag it actually holds, then slide the freed shelf space toward letting more passengers board.

Under the metaphor, each passenger is an attention head and the carry-on is the KV cache that head keeps. Heads are not interchangeable: some attend broadly across the whole conversation and genuinely need a large cache; many are local or act as attention sinks, keeping only the last few tokens or a single anchor. A uniform budget has to be sized for the greediest head, so every light head drags around the same oversized allocation. Because the KV cache is what dominates serving memory, that padding is exactly the resource you can least afford to waste.

K and V
Two vectors stored per token (Key + Value)
× 2
Layers
Each layer has its own cache (like 32 filing cabinets)
× 32
Heads
Each attention head stores its own K/V pair
× 32
Head size
Each K or V vector has 128 numbers (d_head)
× 128
Bytes per number
FP16 = 2 bytes per number (half precision)
× 2
Per token (Llama-2 7B):2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB

Tangram replaces the one-size budget with a deterministic per-head budget read off each head's retention pattern — the same way grouped-query attention already exploits the fact that heads can share keys and values rather than each carrying a full copy. It then clusters heads with similar budgets behind shared page tables, and because every budget is fixed up front, it can plan GPU memory ahead of time instead of scheduling it on the fly. The freed memory is not the goal in itself — it is headroom: smaller per-request caches mean more requests fit before the cache caps the batch, and a bigger batch is what turns into throughput.

MHA (standard)QKV8 KV pairs — full sizeMQAQKV1 KV pair — 8× smallerGQA (modern)QKV2 KV pairs — 4× smaller

Here is where it earns its keep, with illustrative numbers. Take a layer with 6 attention heads. A uniform budget sizes every head to the most demanding one — say 100 units of KV each — so the layer holds 6 × 100 = 600 units per request. Now measure what each head actually keeps: a broad head needs ~95, the next ~82, a mixed head ~55, and the three local/sink heads only ~30, ~18, and ~12. Those sum to about 292 units — so the uniform budget spends 600 to do the work of 292, leaving roughly 51% of the KV cache as padding. Reclaim that padding and nearly twice as many requests' caches fit in the same HBM, which is the headroom behind the reported up to 2.6× throughput.

AspectUniform per-head budgetTangram (per-head)
Cache size per headsame for all, sized to the greediest headdeterministic, sized to each head's retention pattern
Wasted KV on light headslarge — the full budget minus what the head keeps~half the KV freed overall (illustrative, setup-dependent)
Memory managementone shared layout, dynamically scheduledhead groups behind independent vectorized page tables
GPU memory planningat runtime, per stepahead-of-time from static budget profiles
Multi-turn throughputbaselineup to 2.6× (Tangram paper)
Model accuracyfully preserved (Tangram paper)

The catch is that a per-head budget is only safe if you can tell, ahead of time, how much each head truly needs — guess too small and you evict KV a broad head still wants, hurting quality. Tangram's claim is that a head's retention pattern is inherent and stable enough to fix the budget deterministically, which is what lets it reclaim the memory and keep accuracy. That makes it complementary to the other ways of shrinking the cache: quantizing each KV entry to fewer bits, or pruning which tokens to keep, attack a different axis than how big each head's budget should be in the first place.

Goes deeper in: LLM Internals → KV Cache → KV Cache Memory

Related explainers

Continue in trackLLM Internals — KV Cache: where the serving memory goes

Frequently Asked Questions