The news. On June 26, 2026, researchers posted InfoKV (arXiv:2606.26875), an entropy-aware KV-cache compression framework. Existing compressors lean on attention weights to decide which cached tokens to keep; InfoKV argues attention overlooks an information-theoretic signal — a token's predictive uncertainty — and introduces a metric it calls Forward Influence to measure how a compressed token shapes future text. Tested on Llama-3.1, Llama-3.2 and DeepSeek-R1, it reports beating attention-only compressors on long-context reasoning, in both long prefilling and decoding. Read the paper →
Picture a detective's evidence board with only so many pins. As a case grows, clues pile up and the board fills, so the detective has to unpin some to make room. The obvious policy is to keep the clues everyone keeps citing — the ones referenced again and again — and recycle the rest. But the clue that finally cracks the case is often the puzzling one: the odd detail nobody could place at the time, pinned in a corner, rarely cited. Unpin it to save space and the break never comes. That puzzling, rarely-cited clue is exactly what a token's attention score tends to undervalue — and exactly what InfoKV is built to keep.
Under the metaphor, the board is the KV cache: the keys and values every past token left behind so future attention can look back at them without recomputing. The cache grows with the context, so a long prompt overflows GPU memory and the engine has to evict tokens — the cache-compression problem.
The standard fix is attention-only: keep the tokens that have drawn the highest attention so far, evict the low-attention ones. It feels safe, because high attention reads like importance. InfoKV's contribution is to show that this misses a whole class of useful tokens. The tokens the model was most uncertain about — the ones whose next-token distribution was widest — turn out to steer far-off text the most, even though attention barely looked at them. As the paper puts it, "tokens associated with high predictive uncertainty exhibit substantially stronger influence on distant future contexts" — while attention-selected tokens mainly influence nearby context. So InfoKV scores each token by Forward Influence, blending three signals: its predictive uncertainty, how its meaning shifts across layers, and its attention score — then keeps the high-influence tokens and evicts the rest. The unsure clue stays pinned.
Where it earns its keep is the long-context regime. Suppose the cache can hold only 2,000 tokens of a 32,000-token context (illustrative — the paper reports no fixed budget). An attention-only policy spends those 2,000 pins on the most-cited tokens and evicts a low-attention token from early in the prompt — the one detail a question 20,000 tokens later actually depends on. InfoKV scores that same token's Forward Influence as high (low attention, but high uncertainty), so it survives the cull. Same 2,000-token budget, but the surviving tokens are the ones that genuinely steer what comes next — which is the paper's account of why InfoKV holds long-context recall better than attention-only eviction does.
Where InfoKV sits next to other KV-cache levers
| Lever | Signal it uses | What it changes | Where it earns its keep |
|---|---|---|---|
| Attention-only eviction (H2O / SnapKV-style) | attention weight | which pairs are kept | cheap importance estimate, but undervalues low-attention-yet-informative tokens |
| KV quantization | none — applies to all pairs | bytes per pair (fewer bits) | shrinks every kept pair; orthogonal to which pairs you keep |
| GQA | none — architectural | K/V heads shared across query heads | a fixed-ratio cut baked into the model |
| InfoKV (this paper) | uncertainty + layer-evolution + attention | which pairs are kept | long-context reasoning, where the tokens attention drops are the ones that matter far ahead |
The compositional point: InfoKV is a selection rule, so it lives on the same axis as attention-only eviction and is orthogonal to quantization — you could quantize the pairs InfoKV decides to keep. The paper itself focuses on the selection result and reports beating attention-only methods on long-context reasoning across Llama-3.1, Llama-3.2 and DeepSeek-R1; it does not benchmark the stacked combination. The headline is not a smaller cache; it is a smarter choice of what goes in it — keep the tokens that shape the future, not just the ones attention happened to read.
Goes deeper in: LLM Internals → KV Cache → Memory Cost
Continue in trackKV Cache — why the cache grows, and what compression has to give upRelated explainers
- SP-KV — Utility predictor for the KV cache — the same "which pairs to keep" question, answered with a learned utility head instead of an information-theoretic score
- WorldKV — Evict-and-reinsert KV memory — eviction with a second chance: send chunks to storage and fetch them back on revisit
- AnchorKV — safety-aware KV compression — another argument that attention-only importance drops the wrong tokens, here the ones that hold a refusal