Why does single-token cost grow with context length?

A transformer block runs two main computations per token. The feed-forward network is fixed in cost — same work whether the conversation is 4K or 1M tokens long. Attention, in contrast, scales linearly with context because the new token must compute a similarity score against every prior token's cached key vector. At long context, the linear term swamps the constant term and attention dominates the per-token bill.

What is the KV cache and why does it dominate memory at long context?

The KV cache is the per-token storage of key and value vectors so attention does not recompute them on every step. Its size is context_length × layers × heads × head_dim × 2 × bytes_per_value — strictly linear in context length, with every other multiplier fixed by model architecture. At 1M context this can run to tens or hundreds of gigabytes for a large model, often exceeding the model weights themselves and capping how many concurrent long-context users a cluster can serve.

How much do V4-Pro and V4-Flash improve compared to V3.2?

Per the Hugging Face model card at 1M context: V4-Pro performs the per-token computation in 27% of V3.2's FLOPs and uses 10% of its KV cache. V4-Flash pushes both further to 10% FLOPs and 7% KV cache. Practically, that is roughly 10× more concurrent 1M-context users on V4-Pro and ~14× more on V4-Flash for a fixed GPU cluster. The architectural mechanism is not yet documented in the preview release.

DeepSeek V4-Pro and V4-Flash — long-context cost cut to a fraction

LLM

learnaivisually.com/ai-explained/deepseek-v4-long-context-cost

Jargon

KV cache: The per-token storage of key and value vectors so attention does not recompute them on every decoding step. Its size is context_length × layers × heads × head_dim × 2 × bytes_per_value — strictly linear in context length. Deep dive →
attention: The mechanism that lets each new token look back over every prior token and weight how much each one matters. The compute cost grows linearly with context length, which is why it dominates per-token FLOPs at 1M context. How scores are computed →
FFN: Feed-forward network. The dense per-token MLP inside every transformer block. Its cost is fixed regardless of context length — it processes each token independently, with no memory of prior tokens.
FLOPs: Floating-point operations. Used here as a measure of per-token compute cost — how much arithmetic the GPU must do to produce one output token. At long context, attention FLOPs dwarf FFN FLOPs by ~125× at 1M tokens.
HBM: High-Bandwidth Memory — the GPU’s main DRAM. The KV cache lives here, and at 1M context it can rival or exceed the model weights themselves in HBM footprint, capping how many concurrent users a cluster can serve.
GQA: Grouped-Query Attention. An attention variant that shares key/value heads across groups of query heads, reducing the KV cache size by a factor equal to the group size. V3.2-style baselines already use GQA; V4 cuts further still.
MoE: Mixture-of-Experts. An inference-time architecture where a router activates only a fraction of expert sub-networks per token, reducing active FLOPs. DeepSeek V3.2, V4-Pro, and V4-Flash are all MoE models.

The news. On April 24, 2026, DeepSeek released previews of V4-Pro and V4-Flash — long-context models built around dramatically cheaper per-token inference. At 1M tokens of context, V4-Pro reaches 27% of V3.2's per-token FLOPs and 10% of its KV cache, while V4-Flash pushes further to 10% FLOPs and 7% KV. Both numbers move together because they are two sides of one bill: the work to produce the next token, and the memory needed to remember everything before it. Source: Hugging Face model card →

Picture the librarian. Every time someone asks a new question, two things happen. The librarian thinks for a fixed amount of time — that's the feed-forward network, the per-layer computation that doesn't care how long the conversation has been. Then the librarian re-walks the entire shelf, one book at a time, weighing how much each prior page bears on the new question. That re-walk is attention, and at 1M context the shelf has roughly a million books on it. The walking dwarfs the thinking by ~125×. Meanwhile every prior question and answer has already left a book on the shelf — the KV cache — and that shelf occupies real memory, growing strictly linearly with how long the conversation has run.

The two costs are joined at the hip. The single-token FLOPs is the floating-point work to produce one more output token, and at long context it is dominated by the linear-in-context attention scan over every prior key vector. The KV cache is the GPU memory reserved to hold those keys and values, and its size follows context_length × layers × heads × head_dim × 2 × bytes_per_value — a formula in which only context length is the user's choice; the rest are architecture decisions baked at design time. The cache memory at 1M context can rival or exceed the model weights themselves. Make the shelf shorter — fewer books per token of context — and you simultaneously cut the time to walk it AND the memory to hold it. That is why V4 ships ONE story, not two.

V4-Pro and V4-Flash deliver both reductions simultaneously. The Hugging Face model card publishes the numbers — V4-Pro: 27% FLOPs, 10% KV and V4-Flash: 10% FLOPs, 7% KV — but the architectural specifics are not yet documented in the preview release. Whatever the technique (a modified attention pattern, more aggressive head sharing, a smaller per-token KV footprint, or some combination), the user-visible result is the same: the librarian skims a fraction of the books and keeps a fraction of them on the shelf, and the answer quality holds up at frontier-lab benchmarks.

The worked example is what makes this concrete. At 4K context, a single output token takes roughly 1 unit of attention work plus 1 unit of FFN work — call it ~2 units total. At 1M context — 250× longer — the FFN cost is unchanged but attention balloons to ~250 units, so the same token now costs roughly 251 units, a ~125× jump, almost entirely from the attention scan. On the memory side, large MoE-class models at 1M context with grouped-query attention and FP8 cache land in the ~50–100 GB range for V3.2-style baselines. V4-Pro's 10% drops that to ~5–10 GB; V4-Flash's 7% reaches ~3.5–7 GB. A serving cluster that previously held the cache for one concurrent 1M-context user can now hold it for ~10× more on V4-Pro, or ~14× more on V4-Flash — same hardware, same context length, dramatically more users sharing it.

Goes deeper in: LLM Internals → Attention → Computing Attention Scores and LLM Internals → KV Cache → Memory

DeepSeek V4-Pro and V4-Flash — long-context cost cut to a fraction

Frequently Asked Questions