The news. On June 9, 2026, a Tencent team (Wang et al., arXiv 2606.09079) released FlashMemory-DeepSeek-V4, introducing Lookahead Sparse Attention (LSA). Instead of loading the entire KV cache during long-context decode, a Neural Memory Indexer predicts which context chunks matter and keeps only their KV entries. On DeepSeek-V4 it cuts the physical KV-cache footprint to 13.5% of the full-context baseline — over 90% smaller at 500K context — while improving accuracy +0.6% on average. Read the paper →
Picture the metaphor for a moment. The model has read an enormous document and filed every page into an archive — that archive is the KV cache. To write the next word, the naive approach hauls the entire archive to the desk, every single time. LSA hires an assistant who has learned your habits: before each step they glance at what you're working on, predict which few folders you'll actually open, and set out only those. The desk stays nearly empty, and in the paper's reported evaluations the quality holds. That "predict, then fetch only the few" move is the whole idea.
Why does the archive — not the arithmetic — become the wall? Every token the model reads leaves a Key and a Value in the KV cache, and the cache is multiplied out across layers and heads, so it balloons with context length. The breakdown below shows where the gigabytes come from:
At a few thousand tokens nobody notices; at 500K the cache becomes the dominant draw on GPU serving memory and caps how much the device can hold. LSA's Neural Memory Indexer attacks that directly: it scores chunks of the cached past, flags the ones the current token is predicted to need, and keeps only their KV entries — a selective gather of the query-critical chunks rather than a full sweep of the cache.
The genuinely new trick is how the indexer is trained. A straightforward way to learn which chunks matter would be to run the full model end-to-end, which means holding the trillion-scale backbone in GPU memory the whole time. FlashMemory trains the indexer backbone-free and decoupled — the base model never has to be loaded during indexer training — so the part that earns the savings is also cheap to build.
This is a different lever from the block-sparse schemes you may have seen. MiniMax's MSA gathers a few KV blocks per query to cut attention compute, but it still keeps the whole cache resident. LSA aims one layer down: it avoids holding most of the cache at all, so the win is the physical memory footprint. Same family — select instead of sweep — but one saves FLOPs and the other saves gigabytes.
Where the 13.5% comes from
Hold the setup fixed and walk it. Say the full-context KV cache at 500K tokens weighs 40 GB (illustrative) — already a heavy load for a single GPU. LSA's indexer keeps only the predicted-relevant chunks; at the paper's reported ratio it preserves 13.5% of that cache, so the resident footprint drops to about 5.4 GB — a ~34.6 GB saving, the "over 90% smaller at 500K" the authors report. The surprising part is the accuracy line: selection usually costs a little quality, but FlashMemory reports +0.6% on average, because the indexer drops mostly the chunks a token wasn't going to use anyway.
How the long-context attention options compare
| Approach | What stays in the KV cache | What it saves | Note |
|---|---|---|---|
| Full KV cache (dense) | every token's Key and Value, always | nothing — the baseline | exact, but the footprint balloons with length |
| Sliding-window | only a fixed recent window | memory, by forgetting far context | cheap; loses long-range recall |
| Block-sparse gather (DSA / MoBA / MSA) | the whole cache, but reads a few blocks per query | mostly attention compute | cache stays resident; saves FLOPs, not footprint |
| LSA (FlashMemory) | only indexer-predicted chunks | physical footprint → ~13.5% (Wang et al.; setup-dependent) | +0.6% accuracy; indexer trained backbone-free |
One caveat worth keeping: the headline numbers — 13.5% footprint, over-90% reduction at 500K, +0.6% accuracy — are the authors' own results on DeepSeek-V4 at a single operating point, and selective-cache methods are setup-dependent: chunk size, how aggressively the indexer prunes, the sequence length, and the task all move them. The durable lesson is the lever (predict which chunks matter, hold only those); the exact percentage is a reported headline, not a guarantee at every length.
Goes deeper in: LLM Internals → KV Cache → Memory Cost
Related explainers
- MiniMax Sparse Attention (MSA) — the compute sibling: block-sparse gather that saves FLOPs while the cache stays resident
- SP-KV — a utility predictor for the KV cache — another learned selector, pruning the cache by predicted usefulness
- Tangram — per-head KV cache budgets — a different way to shrink long-context cache, by sizing each head's budget instead of indexing chunks