What is InfoKV's entropy-aware KV-cache compression?

InfoKV is a KV-cache compression framework, posted June 26, 2026, that decides which cached tokens to keep using information-theoretic signals rather than attention weights alone. It scores each token by a metric it calls Forward Influence — combining the token's predictive uncertainty (the entropy of its next-token distribution), how its representation evolves across layers, and its attention score — then keeps the high-influence tokens and evicts the rest. The paper reports it beats attention-only compressors on long-context reasoning across Llama-3.1, Llama-3.2 and DeepSeek-R1.

Why isn't attention enough to decide what to keep in the KV cache?

Attention scores capture which tokens the model is looking at, which reads like importance, but InfoKV's analysis finds attention-selected tokens mostly influence nearby context. The tokens the model was most uncertain about — wide, high-entropy next-token distributions — turn out to shape distant future text more strongly, yet attention barely weights them. An attention-only compressor can therefore evict the very tokens long-context recall may depend on, which is the failure InfoKV's Forward Influence metric is designed to address.

How does InfoKV relate to KV quantization?

They operate on different axes and compose. Quantization (FP8, INT4, 2-bit codebooks) shrinks every kept KV pair by using fewer bits per value; InfoKV instead chooses which pairs are worth keeping at all. Because one changes pair size and the other changes pair retention, you could in principle quantize the pairs InfoKV selects for a combined saving — though the paper focuses on the selection result and does not benchmark the stack.

InfoKV: entropy-aware KV-cache compression keeps long-context recall — Forward Influence

Jargon

KV cache: The cached keys and values from every prior token, so attention does not recompute them each decode step. Its size grows with sequence length, which is why long context needs compression. Primer →
Attention-only KV compression: The prior family of methods (H2O, SnapKV and similar) that score a token's importance by its attention weight and evict the low-attention ones. InfoKV's headline claim is that attention alone misses tokens that matter later.
Predictive uncertainty: How unsure the model was when it processed a token, measured from the spread (entropy) of its next-token distribution. InfoKV treats high uncertainty as a signal that a token is informative, not noise.
Forward Influence: InfoKV's metric for how much a compressed token affects future contexts. The paper's key observation: high-attention tokens influence mostly nearby text, while high-uncertainty tokens influence distant text more.
Layer-wise representation evolution: How much a token's hidden vector changes as it passes up through the layers. InfoKV folds this into its score alongside uncertainty and attention.
Long-context recall: Whether the model can still use information from far earlier in a long prompt. It is exactly what aggressive cache compression tends to break, and the benchmark axis InfoKV targets.

The news. On June 26, 2026, researchers posted InfoKV (arXiv:2606.26875), an entropy-aware KV-cache compression framework. Existing compressors lean on attention weights to decide which cached tokens to keep; InfoKV argues attention overlooks an information-theoretic signal — a token's predictive uncertainty — and introduces a metric it calls Forward Influence to measure how a compressed token shapes future text. Tested on Llama-3.1, Llama-3.2 and DeepSeek-R1, it reports beating attention-only compressors on long-context reasoning, in both long prefilling and decoding. Read the paper →

Picture a detective's evidence board with only so many pins. As a case grows, clues pile up and the board fills, so the detective has to unpin some to make room. The obvious policy is to keep the clues everyone keeps citing — the ones referenced again and again — and recycle the rest. But the clue that finally cracks the case is often the puzzling one: the odd detail nobody could place at the time, pinned in a corner, rarely cited. Unpin it to save space and the break never comes. That puzzling, rarely-cited clue is exactly what a token's attention score tends to undervalue — and exactly what InfoKV is built to keep.

Under the metaphor, the board is the KV cache: the keys and values every past token left behind so future attention can look back at them without recomputing. The cache grows with the context, so a long prompt overflows GPU memory and the engine has to evict tokens — the cache-compression problem.

The standard fix is attention-only: keep the tokens that have drawn the highest attention so far, evict the low-attention ones. It feels safe, because high attention reads like importance. InfoKV's contribution is to show that this misses a whole class of useful tokens. The tokens the model was most uncertain about — the ones whose next-token distribution was widest — turn out to steer far-off text the most, even though attention barely looked at them. As the paper puts it, "tokens associated with high predictive uncertainty exhibit substantially stronger influence on distant future contexts" — while attention-selected tokens mainly influence nearby context. So InfoKV scores each token by Forward Influence, blending three signals: its predictive uncertainty, how its meaning shifts across layers, and its attention score — then keeps the high-influence tokens and evicts the rest. The unsure clue stays pinned.

Where it earns its keep is the long-context regime. Suppose the cache can hold only 2,000 tokens of a 32,000-token context (illustrative — the paper reports no fixed budget). An attention-only policy spends those 2,000 pins on the most-cited tokens and evicts a low-attention token from early in the prompt — the one detail a question 20,000 tokens later actually depends on. InfoKV scores that same token's Forward Influence as high (low attention, but high uncertainty), so it survives the cull. Same 2,000-token budget, but the surviving tokens are the ones that genuinely steer what comes next — which is the paper's account of why InfoKV holds long-context recall better than attention-only eviction does.

Where InfoKV sits next to other KV-cache levers

Lever	Signal it uses	What it changes	Where it earns its keep
Attention-only eviction (H2O / SnapKV-style)	attention weight	which pairs are kept	cheap importance estimate, but undervalues low-attention-yet-informative tokens
KV quantization	none — applies to all pairs	bytes per pair (fewer bits)	shrinks every kept pair; orthogonal to which pairs you keep
GQA	none — architectural	K/V heads shared across query heads	a fixed-ratio cut baked into the model
InfoKV (this paper)	uncertainty + layer-evolution + attention	which pairs are kept	long-context reasoning, where the tokens attention drops are the ones that matter far ahead

The compositional point: InfoKV is a selection rule, so it lives on the same axis as attention-only eviction and is orthogonal to quantization — you could quantize the pairs InfoKV decides to keep. The paper itself focuses on the selection result and reports beating attention-only methods on long-context reasoning across Llama-3.1, Llama-3.2 and DeepSeek-R1; it does not benchmark the stacked combination. The headline is not a smaller cache; it is a smarter choice of what goes in it — keep the tokens that shape the future, not just the ones attention happened to read.

Goes deeper in: LLM Internals → KV Cache → Memory Cost

Continue in trackKV Cache — why the cache grows, and what compression has to give up

Related explainers

SP-KV — Utility predictor for the KV cache — the same "which pairs to keep" question, answered with a learned utility head instead of an information-theoretic score
WorldKV — Evict-and-reinsert KV memory — eviction with a second chance: send chunks to storage and fetch them back on revisit
AnchorKV — safety-aware KV compression — another argument that attention-only importance drops the wrong tokens, here the ones that hold a refusal

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based