The news. On June 26, 2026, researchers posted InfoKV (arXiv:2606.26875), an entropy-aware KV-cache compression framework. Existing compressors lean on attention weights to decide which cached tokens to keep; InfoKV argues attention overlooks an information-theoretic signal — a token's predictive uncertainty — and introduces a metric it calls Forward Influence to measure how a compressed token shapes future text. Tested on Llama-3.1, Llama-3.2 and DeepSeek-R1, it reports beating attention-only compressors on long-context reasoning, in both long prefilling and decoding. Read the paper →

Picture a detective's evidence board with only so many pins. As a case grows, clues pile up and the board fills, so the detective has to unpin some to make room. The obvious policy is to keep the clues everyone keeps citing — the ones referenced again and again — and recycle the rest. But the clue that finally cracks the case is often the puzzling one: the odd detail nobody could place at the time, pinned in a corner, rarely cited. Unpin it to save space and the break never comes. That puzzling, rarely-cited clue is exactly what a token's attention score tends to undervalue — and exactly what InfoKV is built to keep.

Under the metaphor, the board is the KV cache: the keys and values every past token left behind so future attention can look back at them without recomputing. The cache grows with the context, so a long prompt overflows GPU memory and the engine has to evict tokens — the cache-compression problem.

K and V
Two vectors stored per token (Key + Value)
× 2
Layers
Each layer has its own cache (like 32 filing cabinets)
× 32
Heads
Each attention head stores its own K/V pair
× 32
Head size
Each K or V vector has 128 numbers (d_head)
× 128
Bytes per number
FP16 = 2 bytes per number (half precision)
× 2
Per token (Llama-2 7B):2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB

The standard fix is attention-only: keep the tokens that have drawn the highest attention so far, evict the low-attention ones. It feels safe, because high attention reads like importance. InfoKV's contribution is to show that this misses a whole class of useful tokens. The tokens the model was most uncertain about — the ones whose next-token distribution was widest — turn out to steer far-off text the most, even though attention barely looked at them. As the paper puts it, "tokens associated with high predictive uncertainty exhibit substantially stronger influence on distant future contexts" — while attention-selected tokens mainly influence nearby context. So InfoKV scores each token by Forward Influence, blending three signals: its predictive uncertainty, how its meaning shifts across layers, and its attention score — then keeps the high-influence tokens and evicts the rest. The unsure clue stays pinned.

Where it earns its keep is the long-context regime. Suppose the cache can hold only 2,000 tokens of a 32,000-token context (illustrative — the paper reports no fixed budget). An attention-only policy spends those 2,000 pins on the most-cited tokens and evicts a low-attention token from early in the prompt — the one detail a question 20,000 tokens later actually depends on. InfoKV scores that same token's Forward Influence as high (low attention, but high uncertainty), so it survives the cull. Same 2,000-token budget, but the surviving tokens are the ones that genuinely steer what comes next — which is the paper's account of why InfoKV holds long-context recall better than attention-only eviction does.

Where InfoKV sits next to other KV-cache levers

LeverSignal it usesWhat it changesWhere it earns its keep
Attention-only eviction (H2O / SnapKV-style)attention weightwhich pairs are keptcheap importance estimate, but undervalues low-attention-yet-informative tokens
KV quantizationnone — applies to all pairsbytes per pair (fewer bits)shrinks every kept pair; orthogonal to which pairs you keep
GQAnone — architecturalK/V heads shared across query headsa fixed-ratio cut baked into the model
InfoKV (this paper)uncertainty + layer-evolution + attentionwhich pairs are keptlong-context reasoning, where the tokens attention drops are the ones that matter far ahead

The compositional point: InfoKV is a selection rule, so it lives on the same axis as attention-only eviction and is orthogonal to quantization — you could quantize the pairs InfoKV decides to keep. The paper itself focuses on the selection result and reports beating attention-only methods on long-context reasoning across Llama-3.1, Llama-3.2 and DeepSeek-R1; it does not benchmark the stacked combination. The headline is not a smaller cache; it is a smarter choice of what goes in it — keep the tokens that shape the future, not just the ones attention happened to read.

Goes deeper in: LLM Internals → KV Cache → Memory Cost

Continue in trackKV Cache — why the cache grows, and what compression has to give up

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based