What is a per-head KV cache budget?

It is a separate KV-cache size for each attention head, instead of one budget applied to every head. Tangram sets each head's budget deterministically from its retention pattern — how much of the past that head actually attends to — so broad heads keep a large cache while local or attention-sink heads keep a small one. The result is far less wasted cache than sizing every head to the most demanding one.

Why does it matter for multi-turn serving?

In multi-turn conversations the KV cache keeps growing and dominates GPU memory and bandwidth, which caps how many requests can run at once. By right-sizing each head's cache, Tangram frees a large fraction of that memory — illustratively about half — so more requests' caches fit in the same HBM. A bigger concurrent batch is what produces the reported up to 2.6x throughput, with model accuracy fully preserved.

How does Tangram relate to KV cache quantization or pruning?

They attack different axes and stack. Quantization stores each KV entry in fewer bits; token pruning drops which past tokens to keep; Tangram instead decides how big each head's budget should be and clusters heads behind shared page tables. Because per-head budgeting is about allocation rather than per-entry precision or token selection, it is complementary to both — you can right-size budgets and still quantize or prune within them.

Tangram speeds multi-turn serving up to 2.6× — Per-head KV cache budgets

Tangram — Per-head KV cache budgets

LLM

learnaivisually.com/ai-explained/tangram-per-head-kv-budgets

Jargon

KV cache: The stored keys and values for every past token, so the model never re-computes attention over the prompt on each new token. It grows with sequence length and is the memory that dominates long, multi-turn serving.
Attention head: One of the parallel attention sub-units in a layer. Each head learns its own pattern of what to look back at — and, crucially for Tangram, different heads keep very different amounts of the past.
Retention pattern: How far back a given head actually attends. Some heads spread attention broadly across the whole history; others are local (recent tokens only) or behave as attention sinks (a couple of anchor tokens). That spread is what makes a uniform budget wasteful.
Page table / paged attention: The KV cache is split into fixed blocks tracked by a page table, like virtual memory. Tangram clusters heads with similar budgets behind independent vectorized page tables so each group is managed together.
Multi-turn serving: Serving a back-and-forth conversation where the cache (and any reused prefix) keeps growing across turns — the regime where KV memory pressure bites hardest.
Ahead-of-time (AoT) load balancing: Because each head's budget is fixed and known in advance, Tangram can plan GPU memory placement before runtime instead of scheduling it dynamically — which the paper says removes per-step scheduling overhead.

The news. On June 4, 2026, researchers posted Tangram, a system for multi-turn LLM serving built on a single observation: KV-cache importance is non-uniform across attention heads, so giving every head the same cache budget wastes memory. Tangram has three parts — Deterministic Budget Allocation fixes each head's KV size from its inherent retention pattern; Head-Group Paging clusters heads with similar needs behind independent vectorized page tables; and Ahead-of-Time Load Balancing plans GPU memory from those static profiles before runtime. It reports up to 2.6× serving throughput over existing baselines while fully preserving model accuracy. Read the paper →

Picture boarding a plane where every passenger is handed the same full-size overhead bin, whether they brought a rolling suitcase or a paperback. The heavy travelers fill theirs; the light ones leave a bin that is almost entirely air. The cabin runs out of bin space long before it runs out of seats — not because the bags are big, but because the bins were all sized for the biggest bag. Tangram's whole move is to size each bin to the bag it actually holds, then slide the freed shelf space toward letting more passengers board.

Under the metaphor, each passenger is an attention head and the carry-on is the KV cache that head keeps. Heads are not interchangeable: some attend broadly across the whole conversation and genuinely need a large cache; many are local or act as attention sinks, keeping only the last few tokens or a single anchor. A uniform budget has to be sized for the greediest head, so every light head drags around the same oversized allocation. Because the KV cache is what dominates serving memory, that padding is exactly the resource you can least afford to waste.

Tangram replaces the one-size budget with a deterministic per-head budget read off each head's retention pattern — the same way grouped-query attention already exploits the fact that heads can share keys and values rather than each carrying a full copy. It then clusters heads with similar budgets behind shared page tables, and because every budget is fixed up front, it can plan GPU memory ahead of time instead of scheduling it on the fly. The freed memory is not the goal in itself — it is headroom: smaller per-request caches mean more requests fit before the cache caps the batch, and a bigger batch is what turns into throughput.

Here is where it earns its keep, with illustrative numbers. Take a layer with 6 attention heads. A uniform budget sizes every head to the most demanding one — say 100 units of KV each — so the layer holds 6 × 100 = 600 units per request. Now measure what each head actually keeps: a broad head needs ~95, the next ~82, a mixed head ~55, and the three local/sink heads only ~30, ~18, and ~12. Those sum to about 292 units — so the uniform budget spends 600 to do the work of 292, leaving roughly 51% of the KV cache as padding. Reclaim that padding and nearly twice as many requests' caches fit in the same HBM, which is the headroom behind the reported up to 2.6× throughput.

Aspect	Uniform per-head budget	Tangram (per-head)
Cache size per head	same for all, sized to the greediest head	deterministic, sized to each head's retention pattern
Wasted KV on light heads	large — the full budget minus what the head keeps	~half the KV freed overall (illustrative, setup-dependent)
Memory management	one shared layout, dynamically scheduled	head groups behind independent vectorized page tables
GPU memory planning	at runtime, per step	ahead-of-time from static budget profiles
Multi-turn throughput	baseline	up to 2.6× (Tangram paper)
Model accuracy	—	fully preserved (Tangram paper)

The catch is that a per-head budget is only safe if you can tell, ahead of time, how much each head truly needs — guess too small and you evict KV a broad head still wants, hurting quality. Tangram's claim is that a head's retention pattern is inherent and stable enough to fix the budget deterministically, which is what lets it reclaim the memory and keep accuracy. That makes it complementary to the other ways of shrinking the cache: quantizing each KV entry to fewer bits, or pruning which tokens to keep, attack a different axis than how big each head's budget should be in the first place.

Goes deeper in: LLM Internals → KV Cache → KV Cache Memory

Related explainers

SP-KV — self-pruned KV cache — the orthogonal axis: deciding which tokens to drop from the cache, not how big each head's budget is
KVARN — Hadamard 2-bit KV cache — shrinking the cache by storing each KV entry in fewer bits rather than fewer entries
SGLang v0.5.12 — MLA token-speed — multi-head latent attention, another route to a smaller per-head KV footprint

Continue in trackLLM Internals — KV Cache: where the serving memory goes

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based