The news. On June 9, 2026, a Tencent team (Wang et al., arXiv 2606.09079) released FlashMemory-DeepSeek-V4, introducing Lookahead Sparse Attention (LSA). Instead of loading the entire KV cache during long-context decode, a Neural Memory Indexer predicts which context chunks matter and keeps only their KV entries. On DeepSeek-V4 it cuts the physical KV-cache footprint to 13.5% of the full-context baseline — over 90% smaller at 500K context — while improving accuracy +0.6% on average. Read the paper →

Picture the metaphor for a moment. The model has read an enormous document and filed every page into an archive — that archive is the KV cache. To write the next word, the naive approach hauls the entire archive to the desk, every single time. LSA hires an assistant who has learned your habits: before each step they glance at what you're working on, predict which few folders you'll actually open, and set out only those. The desk stays nearly empty, and in the paper's reported evaluations the quality holds. That "predict, then fetch only the few" move is the whole idea.

Why does the archive — not the arithmetic — become the wall? Every token the model reads leaves a Key and a Value in the KV cache, and the cache is multiplied out across layers and heads, so it balloons with context length. The breakdown below shows where the gigabytes come from:

K and V
Two vectors stored per token (Key + Value)
× 2
Layers
Each layer has its own cache (like 32 filing cabinets)
× 32
Heads
Each attention head stores its own K/V pair
× 32
Head size
Each K or V vector has 128 numbers (d_head)
× 128
Bytes per number
FP16 = 2 bytes per number (half precision)
× 2
Per token (Llama-2 7B):2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB

At a few thousand tokens nobody notices; at 500K the cache becomes the dominant draw on GPU serving memory and caps how much the device can hold. LSA's Neural Memory Indexer attacks that directly: it scores chunks of the cached past, flags the ones the current token is predicted to need, and keeps only their KV entries — a selective gather of the query-critical chunks rather than a full sweep of the cache.

The genuinely new trick is how the indexer is trained. A straightforward way to learn which chunks matter would be to run the full model end-to-end, which means holding the trillion-scale backbone in GPU memory the whole time. FlashMemory trains the indexer backbone-free and decoupled — the base model never has to be loaded during indexer training — so the part that earns the savings is also cheap to build.

This is a different lever from the block-sparse schemes you may have seen. MiniMax's MSA gathers a few KV blocks per query to cut attention compute, but it still keeps the whole cache resident. LSA aims one layer down: it avoids holding most of the cache at all, so the win is the physical memory footprint. Same family — select instead of sweep — but one saves FLOPs and the other saves gigabytes.

Where the 13.5% comes from

Hold the setup fixed and walk it. Say the full-context KV cache at 500K tokens weighs 40 GB (illustrative) — already a heavy load for a single GPU. LSA's indexer keeps only the predicted-relevant chunks; at the paper's reported ratio it preserves 13.5% of that cache, so the resident footprint drops to about 5.4 GB — a ~34.6 GB saving, the "over 90% smaller at 500K" the authors report. The surprising part is the accuracy line: selection usually costs a little quality, but FlashMemory reports +0.6% on average, because the indexer drops mostly the chunks a token wasn't going to use anyway.

How the long-context attention options compare

ApproachWhat stays in the KV cacheWhat it savesNote
Full KV cache (dense)every token's Key and Value, alwaysnothing — the baselineexact, but the footprint balloons with length
Sliding-windowonly a fixed recent windowmemory, by forgetting far contextcheap; loses long-range recall
Block-sparse gather (DSA / MoBA / MSA)the whole cache, but reads a few blocks per querymostly attention computecache stays resident; saves FLOPs, not footprint
LSA (FlashMemory)only indexer-predicted chunksphysical footprint → ~13.5% (Wang et al.; setup-dependent)+0.6% accuracy; indexer trained backbone-free

One caveat worth keeping: the headline numbers — 13.5% footprint, over-90% reduction at 500K, +0.6% accuracy — are the authors' own results on DeepSeek-V4 at a single operating point, and selective-cache methods are setup-dependent: chunk size, how aggressively the indexer prunes, the sequence length, and the task all move them. The durable lesson is the lever (predict which chunks matter, hold only those); the exact percentage is a reported headline, not a guarantee at every length.

Goes deeper in: LLM Internals → KV Cache → Memory Cost

Related explainers

Continue in trackLLM Internals — KV Cache & its Memory Cost

Frequently Asked Questions