Learn AI VisuallyTracksAI Explained

KV Cache in LLM Inference Explained

The Redundancy Problem

What is the KV Cache?

During text generation, the model recomputes attention for the entire sequence at every step — including tokens it has already processed. The KV cache solves this by storing the Key and Value tensors from previous tokens so they only need to be computed once. This turns generation from O(n²) recomputation into O(n) incremental work. The tradeoff is memory: for a 70B parameter model serving a 4K-token sequence, the KV cache alone can use 2–5 GB of GPU memory per request, making it the primary constraint on how many users a server can handle simultaneously.

The Redundancy Problem

How generation uses attention — a concrete example

Imagine the model has generated "The cat" and is about to predict the next word.

To decide what comes next, the model asks a question on behalf of the new token position: "Given everything so far, what should I pay attention to?" This question is the Query (Q).

To answer it, the model needs to check every previous token. Each previous token has two things:

  • A Key (K) — a label describing what that token contains (like a book title). "The" has a key, "cat" has a key.
  • A Value (V) — the actual information that token carries (like the book's content). "The" has a value, "cat" has a value.

The model compares the Query against all Keys to find which tokens matter, then reads those tokens' Values to produce the output.

TokensKeysValues
The
K[The]
V[The]
cat
K[cat]
V[cat]
sat
K[sat]
V[sat]
???
Query (Q)
New token's Query compares against all previous Keys → reads matching Values

The problem: to predict "sat" (the 3rd token), the model already computed K and V for "The" and "cat." Now to predict "on" (the 4th token), it needs K and V for "The", "cat", and "sat." The only new K/V needed is for "sat." But the model doesn't remember the K/V it already computed for "The" and "cat" — it runs the same matrix multiplications again to get the exact same numbers. That's wasted work.

How the waste grows

Without any optimization, every step recomputes everything. Each row below is one generation step — the colored squares are K/V computations performed at that step:

ThecatsatonthematStep 11 opStep 22 opsStep 33 opsStep 44 opsStep 55 opsStep 66 ops= 21 totaln(n+1)/2 = 6×7/2

Each step recomputes K/V for all tokens up to that point — not just the new one. The colored area forms a triangle, which is why the total follows the formula n × (n + 1) / 2. For 1000 tokens:

1000 × 1001 / 2 = 500,500 total K/V computations

But with a cache, only 1000 computations are needed (one per token). That's a 500× difference — and it gets worse the longer the sequence.

Why Recomputing Is Pure Waste

K/V vectors for a token depend only on the token itself and the weights of that layer. "The" at position 3 produces the exact same K/V vectors whether it's computed at step 3 or step 3,000. The matrix multiplications are identical. The output is identical. The computation is redundant.

  • Step 2: recomputes token 1's K/V pair — already done at step 1
  • Step 3: recomputes tokens 1 and 2 — already done at steps 1 and 2
  • Step n: recomputes all n−1 prior tokens' K/V pairs, again

This is the problem the KV cache solves.

K/V vectors for "The" are identical whether computed at step 1 or step 1000. Same matrix multiplications, same weights, same input — same output. Computing them again is pure waste.

Explore the right panel to see the computation cost grow with each token generated.

The KV Cache Solution

The KV Cache Solution

The fix is straightforward: compute each token's K/V vectors once, store them, and reuse them on every subsequent step. This is the KV cache.

How It Works

New token arrivesthe current generation step
Compute K/Vone new pair only
Store in cacheappend to per-layer cache
Attention reads ALL from cachecurrent + all previous K/V
Output tokenappend to sequence, repeat

With the cache, every step computes exactly one new K/V pair regardless of context length. Total cost becomes linear — O(n) instead of O(n²).

Why Q Is Not Cached

Attention uses three vectors per token: Query, Key, and Value.

  • K and V come from the context — tokens that have already been processed. Once computed, they never change. Cache them.
  • Q comes from the current prediction token — the one the model is generating right now. It changes every step. There is nothing to cache.

The cache stores K/V. Q is always freshly computed.

The cache lives at every layer

Remember from the transformer block module: a model like GPT-3 has 96 layers stacked on top of each other. Each layer has its own attention, which means each layer computes its own K and V vectors for every token.

So the KV cache isn't one big storage box — it's one cache per layer. When the model generates a new token, all 96 layers compute and store their K/V for that token. When the next token is generated, all 96 layers read their cached K/V for all previous tokens.

Think of it like a 96-floor building. Each floor (layer) has its own filing cabinet (cache). When a new document (token) arrives, every floor files a copy. When the next document needs context, every floor pulls out all its previous files.

  • Llama-2 7B (32 layers): 32 separate K/V caches growing together
  • GPT-3 (96 layers): 96 separate K/V caches growing together

This is why KV cache memory scales with both sequence length and number of layers.

Where does the KV cache live? Not in the model itself. The trained model weights don't include any cache — they're just the fixed W_Q, W_K, W_V matrices. The KV cache is created and managed by the serving framework at inference time: vLLM, TensorRT-LLM, HuggingFace generate(), or llama.cpp. Karpathy's nanoGPT skips caching for simplicity, but every production system implements it.

Use the right panel to play through the side-by-side comparison — watch how the "without cache" side grows as a triangle while the "with cache" side stays flat at 1 computation per step.

Prefill vs Decode

Prefill vs Decode

When you send a message to ChatGPT, two very different things happen behind the scenes:

PrefillDecode
The
cat
sat
on
All prompt tokens processed at once (parallel)
KV cache fills up in one shot
GPU does lots of math (compute-bound)
Fast — GPU is good at parallel work
the
→
mat
→
.
Output tokens generated one at a time
Each step reads entire KV cache
GPU mostly loads data (memory-bound)
Slower — waiting for data, not computing
Prefill = one big batch (fast) → Decode = one token at a time (slower)

Prefill is like a teacher reading an entire essay at once — the GPU processes all your prompt tokens in parallel, computes K/V for every token, and fills the cache in one shot. This is fast because GPUs are built for parallel work.

Decode is like writing a reply one word at a time — the model generates each output token by looking back at the entire KV cache (everything seen so far), producing one token, then repeating. This is slower because the GPU spends most of its time loading the cache from memory, not computing.

TTFT vs TPS

Time to first token (TTFT) measures how long prefill takes. A short prompt prefills quickly. A 10,000-token document takes 100× longer — the model must process every token before it can output anything.

Tokens per second (TPS) measures decode throughput — how fast the model generates after the first token. Longer context doesn't slow decode dramatically, but larger cache means more memory bandwidth consumed per step.

The two metrics are largely independent. A model can have excellent TPS but poor TTFT for long prompts, or vice versa.

When ChatGPT pauses briefly before the first word appears after you paste a long document — that pause is prefill. The model is processing your entire input and filling the KV cache before it generates even the first token. Longer paste, longer pause.

The simulator on the right panel shows the growing KV cache — this is the cache being read during decode, built during prefill.

Memory Cost

Memory Cost

The KV cache trades computation for memory. But how much memory exactly? Let's build up the formula piece by piece for Llama-2 7B:

K and V
Two vectors stored per token (Key + Value)
× 2
Layers
Each layer has its own cache (like 32 filing cabinets)
× 32
Heads
Each attention head stores its own K/V pair
× 32
Head size
Each K or V vector has 128 numbers (d_head)
× 128
Bytes per number
FP16 = 2 bytes per number (half precision)
× 2
Per token (Llama-2 7B):2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB

Each term corresponds to a real part of the model:

  • × 2 (K and V) — we store two vectors per token: one Key and one Value
  • × 32 (layers) — each transformer layer has its own cache (remember the 32-floor building from step 2)
  • × 32 (heads) — recall from the Attention module: each layer runs multiple attention heads in parallel, each looking for different patterns (syntax, semantics, etc.). Each head stores its own separate K/V pair.
  • × 128 (head size) — each K or V vector has 128 numbers. The model's full dimension (4096) is split evenly across heads: 4096 ÷ 32 heads = 128 per head
  • × 2 (bytes per number) — each number is stored in FP16 (half precision), which takes 2 bytes of memory

Multiply them all: 2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB per token.

That's half a megabyte for every single token in the conversation.

What this means in practice

  • Llama-2 7B at 4,096 tokens: 512 KB × 4,096 = ~2.1 GB per request
  • GPT-3 175B (96 layers, 96 heads): ~4.7 MB/token → 4,096 tokens = ~19 GB per request — only ~5% of its ~350 GB FP16 weight footprint at this context, but the ratio climbs fast as context and concurrency grow (see below)
  • 10 concurrent users on Llama-2 7B: 2.1 GB × 10 = 21 GB just for KV cache, more than the 14 GB FP16 weights of the model itself

Context Window = Memory

Doubling the context window doubles the KV cache. Quadrupling it quadruples it. This is linear, unlike attention compute which was quadratic before Flash Attention.

  • 4K context → 2.1 GB (Llama-2 7B)
  • 32K context → 16.8 GB — already larger than the 14 GB FP16 weights
  • 128K context → 67 GB on one user's request — nearly 5× the 14 GB FP16 weight footprint of Llama-2 7B itself. Long context flips the cost ratio: cache, not weights, dominates HBM

This is why 128K context models need massive GPU clusters: the memory isn't for compute, it's for the cache.

KV cache is the hidden cost of long context. The formula is simple — 2 × layers × heads × d_head × bytes — but the numbers compound fast. 128K tokens × bytes_per_token × concurrent users = total GPU memory needed. This is why GQA, cache quantization, and paged attention exist.

The right panel shows the cache growing token by token — each block is one more token's K/V pair stored in memory.

GQA: Shrinking the Cache

GQA: Shrinking the Cache

From the previous step, we know the KV cache stores a K/V pair per head. More heads = more cache memory. Can we reduce the number of K/V pairs without losing quality?

The key idea: share K/V across query heads

In standard attention, every Query head (Q) has its own dedicated Key and Value heads (K/V). But what if multiple Q heads shared the same K/V?

MHA (standard)QKV8 KV pairs — full sizeMQAQKV1 KV pair — 8× smallerGQA (modern)QKV2 KV pairs — 4× smaller

In the diagram above, pink squares are Query heads (top row) and green squares are K/V heads (bottom row). The lines show which Q heads read from which K/V:

MHA (standard): Each Q head gets its own K/V — one-to-one. 8 Q heads = 8 K/V pairs. Full expressiveness, but the cache is large.

MQA: All Q heads share one K/V pair. The cache shrinks 8× — but all heads see the exact same Key/Value information. It's like 8 students all sharing one textbook — they all read the same page, so they all think the same way. The model loses the ability to look at the input from different angles, and quality drops.

GQA (modern): The fix for MQA's problem. Instead of forcing all heads to share one K/V, split them into small groups. Each group of 4 Q heads shares one K/V pair. Now there are 2 different K/V pairs — so the model still has 2 different perspectives on the input, not just 1. It's like splitting 8 students into 2 study groups of 4, each group with its own textbook. Each group still shares, but different groups read different material — preserving diversity while using far fewer books than giving one to each student.

In Practice

  • Llama-2 70B uses GQA with 8 query heads per K/V head
  • Mistral 7B uses GQA from the start
  • GPT-3 uses standard MHA — predates GQA adoption

An 8× cache reduction means 8× more concurrent users on the same hardware, or 8× longer context at the same memory budget. This is why GQA has been adopted almost universally in large models released after 2023.

GQA is why modern large models can serve long contexts at all. Without it, the KV cache for a 70B model at 128K context would exceed all practical GPU memory limits. Group size is a hardware-aware hyperparameter — tuned to balance cache size against attention quality.

The right panel shows the standard side-by-side cache — imagine the "with cache" blocks compressed 8× for a GQA model.

KV Cache in Production

KV Cache in Production

The KV cache solves the computation problem but creates a memory management problem. Production serving systems spend enormous engineering effort on this cache.

Batch Size vs Context Length

GPU memory is finite. More cache per request = fewer concurrent requests. Both grids below use the same total GPU memory:

32 users × 4K tokensmany short conversationsusers ↑tokens →4 users × 32K tokensfew long conversationsusers ↑tokens →=Same total GPU memory — area of both grids is equal

Serving providers must choose: many short conversations or few long conversations on the same hardware. This is the central serving tradeoff — throughput is maximized by batching, but batch size is constrained by cache memory per request.

KV Cache Quantization

Model weights are often quantized (FP16 → INT8). The KV cache can be quantized independently — shrinking the bytes per number stored in the cache:

Memory per number in KV cacheFP16
2 bytesbaseline
FP8
1 byteminimal loss
INT8
1 byteslight loss
INT4
0.5 bytesnoticeable loss

KV quantization is one of the highest-leverage memory reductions because for long-context requests, the cache can be larger than the model weights themselves.

Eviction Policies

When cache memory fills up, something must be dropped. Toggle between strategies to see which tokens survive:

Remove earliest tokens first

pos 0pos 1pos 2pos 3pos 4pos 5pos 6pos 7
The
cat
sat
on
a
very
soft
mat
■ kept in cache■ evicted

Each strategy has tradeoffs — "drop oldest" is simple but loses early context, "drop least-attended" is smarter but requires tracking attention, and "sliding window" (used in Mistral) is the most efficient but strictly limits how far back the model can look.

Memory-Bandwidth vs Compute

During decode, the GPU does two things: load the KV cache from memory and compute the attention math. Which one is the bottleneck depends on batch size:

Phase:
Batch1
Seq512
Arithmetic intensity:4.5FLOPs/byteridge ≈ 150
→ memory-bandwidth-bound
HBM Bandwidth
100%
FLOPS Compute
3%

Toy roofline calibrated to A100-class (HBM ~2 TB/s, ridge ≈ 150 FLOPs/byte). Other GPUs have different ridges.

Decode rule: AI scales with batch (model weights are loaded once per step and amortized across the batch). Slide seq — the gauges do not move in this toy meter. Batching is the dominant lever for decode throughput here.

With few requests, the GPU finishes computing quickly but spends most time waiting for cache data to arrive from memory. With many requests, the loading cost is shared — the GPU stays busy computing instead of waiting. This is why batching improves throughput.

Flash Attention

To understand Flash Attention, you need to know that a GPU has two types of memory:

  • HBM (High Bandwidth Memory) — the GPU's main memory (e.g., 80 GB on an A100). Large but relatively slow to access. This is where the KV cache lives.
  • SRAM — a tiny, ultra-fast scratchpad inside the GPU's compute cores (e.g., 20 MB on an A100). Data here can be read instantly, but there's very little of it.

The problem: standard attention loads the entire KV cache from HBM into SRAM, computes the scores, writes them back to HBM, loads them again for softmax, writes again... lots of back-and-forth between slow HBM and fast SRAM.

Flash Attention fixes this by computing attention in small tiles:

Standard AttentionHBM (slow)load all ↓write back ↑load all ↓SRAM (fast)
Multiple round trips for all data
Flash AttentionHBM (slow)K₁V₁K₂V₂K₃V₃K₄V₄SRAM (fast)K₁V₁→ compute → next
One tile at a time — click blocks to see ↑
Same result — fewer memory round trips

Instead of loading the entire cache at once (left), Flash Attention processes one tile at a time (right). Click the KV blocks on the right side to see: only one piece moves from HBM to SRAM at a time, gets computed, then the next piece takes its place. Each piece makes one round trip instead of multiple.

The result: 2–5× faster attention, much less memory for intermediate results. The KV cache still exists — Flash Attention just reads it more efficiently, like reading a book one chapter at a time instead of photocopying the entire book first.

KV cache is the central bottleneck of LLM serving. GQA, cache quantization, paged attention, sliding window attention — every major inference optimization is ultimately about managing this cache more efficiently. Understanding it is the foundation for understanding all of LLM infrastructure.

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com