DeepSeek V4-Pro and V4-Flash — long-context cost cut to a fraction

LLM
L
DeepSeek V4 paired-cost-cut animation — three rows comparing V3.2 baseline to V4-Pro and V4-Flash. FLOPs/token and KV cache shrink together, enabling roughly 10× more concurrent 1M-context users on V4-Pro and 14× more on V4-Flash.
learnaivisually.com/ai-explained/deepseek-v4-long-context-cost

The news. On April 24, 2026, DeepSeek released previews of V4-Pro and V4-Flash — long-context models built around dramatically cheaper per-token inference. At 1M tokens of context, V4-Pro reaches 27% of V3.2's per-token FLOPs and 10% of its KV cache, while V4-Flash pushes further to 10% FLOPs and 7% KV. Both numbers move together because they are two sides of one bill: the work to produce the next token, and the memory needed to remember everything before it. Source: Hugging Face model card →

Picture the librarian. Every time someone asks a new question, two things happen. The librarian thinks for a fixed amount of time — that's the feed-forward network, the per-layer computation that doesn't care how long the conversation has been. Then the librarian re-walks the entire shelf, one book at a time, weighing how much each prior page bears on the new question. That re-walk is attention, and at 1M context the shelf has roughly a million books on it. The walking dwarfs the thinking by ~125×. Meanwhile every prior question and answer has already left a book on the shelf — the KV cache — and that shelf occupies real memory, growing strictly linearly with how long the conversation has run.

The two costs are joined at the hip. The single-token FLOPs is the floating-point work to produce one more output token, and at long context it is dominated by the linear-in-context attention scan over every prior key vector. The KV cache is the GPU memory reserved to hold those keys and values, and its size follows context_length × layers × heads × head_dim × 2 × bytes_per_value — a formula in which only context length is the user's choice; the rest are architecture decisions baked at design time. The cache memory at 1M context can rival or exceed the model weights themselves. Make the shelf shorter — fewer books per token of context — and you simultaneously cut the time to walk it AND the memory to hold it. That is why V4 ships ONE story, not two.

K and V
Two vectors stored per token (Key + Value)
× 2
Layers
Each layer has its own cache (like 32 filing cabinets)
× 32
Heads
Each attention head stores its own K/V pair
× 32
Head size
Each K or V vector has 128 numbers (d_head)
× 128
Bytes per number
FP16 = 2 bytes per number (half precision)
× 2
Per token (Llama-2 7B):2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB

V4-Pro and V4-Flash deliver both reductions simultaneously. The Hugging Face model card publishes the numbers — V4-Pro: 27% FLOPs, 10% KV and V4-Flash: 10% FLOPs, 7% KV — but the architectural specifics are not yet documented in the preview release. Whatever the technique (a modified attention pattern, more aggressive head sharing, a smaller per-token KV footprint, or some combination), the user-visible result is the same: the librarian skims a fraction of the books and keeps a fraction of them on the shelf, and the answer quality holds up at frontier-lab benchmarks.

The worked example is what makes this concrete. At 4K context, a single output token takes roughly 1 unit of attention work plus 1 unit of FFN work — call it ~2 units total. At 1M context — 250× longer — the FFN cost is unchanged but attention balloons to ~250 units, so the same token now costs roughly 251 units, a ~125× jump, almost entirely from the attention scan. On the memory side, large MoE-class models at 1M context with grouped-query attention and FP8 cache land in the ~50–100 GB range for V3.2-style baselines. V4-Pro's 10% drops that to ~5–10 GB; V4-Flash's 7% reaches ~3.5–7 GB. A serving cluster that previously held the cache for one concurrent 1M-context user can now hold it for ~10× more on V4-Pro, or ~14× more on V4-Flash — same hardware, same context length, dramatically more users sharing it.

Goes deeper in: LLM Internals → Attention → Computing Attention Scores and LLM Internals → KV Cache → Memory

Frequently Asked Questions