What is the LLM Internals track?

Nine interactive modules covering tokenization, embeddings, self-attention, the transformer block, text generation, KV cache, quantization, batching, and paged attention.

Where should I start?

Module 1 (Tokenization). Each module stands alone but builds on the prior one.

Do I need math background?

Basic linear algebra helps, but every concept is illustrated visually first.

Why cache K and V but not Q?

During generation, each new token only produces one Query vector (for itself). But it needs all previous Keys and Values to compute attention scores and outputs. So K and V are what we'd redundantly recompute without caching. Q for past tokens is never reused because those tokens have already produced their output; only the current token's Q is needed, and it's computed fresh each step. Caching Q would waste memory for no benefit — the KV cache already captures everything the next step needs from history.

What is the difference between prefill and decode?

Prefill processes the entire prompt in parallel, populating the KV cache. Decode generates one token at a time, reading from the cache and appending each new K/V pair. Prefill is compute-bound; decode is memory-bandwidth-bound. The two phases have completely different performance profiles: prefill benefits from large batch sizes and Tensor Cores, while decode throughput is limited by how fast weights can stream from HBM. This asymmetry is the motivation for prefill/decode disaggregation, chunked prefill scheduling, and speculative decoding — all of which protect decode latency from prefill's compute demand.

How much memory does a KV cache use?

Per token: 2 × num_layers × num_heads × head_dim × bytes_per_value. For LLaMA 2 70B in FP16 with a 4K sequence: roughly 2.5 GB per request. The factor of 2 covers both K and V; num_layers and num_heads come from the architecture; bytes_per_value is 2 for FP16 or 1 for INT8. On an 80 GB H100 serving a 70B model, roughly 20 GB of the HBM is consumed by weights and activations, leaving only ~60 GB for KV cache — the direct cap on how many concurrent requests fit.

What is Grouped-Query Attention (GQA)?

Instead of each attention head having its own K and V, multiple heads share K/V. LLaMA 2 70B uses 8 KV heads shared across 64 query heads. This reduces KV cache size by 8× with minimal quality loss. GQA sits between full multi-head attention (one KV pair per query head) and multi-query attention (a single KV pair for all heads). It has become the default for modern LLMs because it roughly matches MHA quality while cutting KV cache and memory bandwidth proportionally to the group ratio — a direct win for decode throughput and long-context serving.

Does the KV cache persist between conversations?

No. The KV cache is per-request, per-generation. When a conversation ends, the cache is freed. For chat systems, the entire conversation history must be re-prefilled on each new message (unless prefix caching is used). Prefix caching keeps KV blocks for shared prompt prefixes across requests — the basis for features like Anthropic's cache_control and vLLM's automatic prefix caching, which can cut TTFT by 90% on repeated system prompts. Without it, a multi-turn chat with a 10K-token history pays full prefill cost on every turn.

KV Cache in LLM Inference Explained

The Redundancy Problem

What is the KV Cache?

During text generation, the model recomputes attention for the entire sequence at every step — including tokens it has already processed. The KV cache solves this by storing the Key and Value tensors from previous tokens so they only need to be computed once. This turns generation from O(n²) recomputation into O(n) incremental work. The tradeoff is memory: for a 70B parameter model serving a 4K-token sequence, the KV cache alone can use 2–5 GB of GPU memory per request, making it the primary constraint on how many users a server can handle simultaneously.

The Redundancy Problem

How generation uses attention — a concrete example

Imagine the model has generated "The cat" and is about to predict the next word.

To decide what comes next, the model asks a question on behalf of the new token position: "Given everything so far, what should I pay attention to?" This question is the Query (Q).

To answer it, the model needs to check every previous token. Each previous token has two things:

A Key (K) — a label describing what that token contains (like a book title). "The" has a key, "cat" has a key.
A Value (V) — the actual information that token carries (like the book's content). "The" has a value, "cat" has a value.

The model compares the Query against all Keys to find which tokens matter, then reads those tokens' Values to produce the output.

The problem: to predict "sat" (the 3rd token), the model already computed K and V for "The" and "cat." Now to predict "on" (the 4th token), it needs K and V for "The", "cat", and "sat." The only new K/V needed is for "sat." But the model doesn't remember the K/V it already computed for "The" and "cat" — it runs the same matrix multiplications again to get the exact same numbers. That's wasted work.

How the waste grows

Without any optimization, every step recomputes everything. Each row below is one generation step — the colored squares are K/V computations performed at that step:

Each step recomputes K/V for all tokens up to that point — not just the new one. The colored area forms a triangle, which is why the total follows the formula n × (n + 1) / 2. For 1000 tokens:

1000 × 1001 / 2 = 500,500 total K/V computations

But with a cache, only 1000 computations are needed (one per token). That's a 500× difference — and it gets worse the longer the sequence.

Why Recomputing Is Pure Waste

K/V vectors for a token depend only on the token itself and the weights of that layer. "The" at position 3 produces the exact same K/V vectors whether it's computed at step 3 or step 3,000. The matrix multiplications are identical. The output is identical. The computation is redundant.

Step 2: recomputes token 1's K/V pair — already done at step 1
Step 3: recomputes tokens 1 and 2 — already done at steps 1 and 2
Step n: recomputes all n−1 prior tokens' K/V pairs, again

This is the problem the KV cache solves.

K/V vectors for "The" are identical whether computed at step 1 or step 1000. Same matrix multiplications, same weights, same input — same output. Computing them again is pure waste.

Explore the right panel to see the computation cost grow with each token generated.

The KV Cache Solution

The fix is straightforward: compute each token's K/V vectors once, store them, and reuse them on every subsequent step. This is the KV cache.

How It Works

With the cache, every step computes exactly one new K/V pair regardless of context length. Total cost becomes linear — O(n) instead of O(n²).

Why Q Is Not Cached

Attention uses three vectors per token: Query, Key, and Value.

K and V come from the context — tokens that have already been processed. Once computed, they never change. Cache them.
Q comes from the current prediction token — the one the model is generating right now. It changes every step. There is nothing to cache.

The cache stores K/V. Q is always freshly computed.

The cache lives at every layer

Remember from the transformer block module: a model like GPT-3 has 96 layers stacked on top of each other. Each layer has its own attention, which means each layer computes its own K and V vectors for every token.

So the KV cache isn't one big storage box — it's one cache per layer. When the model generates a new token, all 96 layers compute and store their K/V for that token. When the next token is generated, all 96 layers read their cached K/V for all previous tokens.

Think of it like a 96-floor building. Each floor (layer) has its own filing cabinet (cache). When a new document (token) arrives, every floor files a copy. When the next document needs context, every floor pulls out all its previous files.

Llama-2 7B (32 layers): 32 separate K/V caches growing together
GPT-3 (96 layers): 96 separate K/V caches growing together

This is why KV cache memory scales with both sequence length and number of layers.

Where does the KV cache live? Not in the model itself. The trained model weights don't include any cache — they're just the fixed W_Q, W_K, W_V matrices. The KV cache is created and managed by the serving framework at inference time: vLLM, TensorRT-LLM, HuggingFace generate(), or llama.cpp. Karpathy's nanoGPT skips caching for simplicity, but every production system implements it.

Use the right panel to play through the side-by-side comparison — watch how the "without cache" side grows as a triangle while the "with cache" side stays flat at 1 computation per step.

Prefill vs Decode

When you send a message to ChatGPT, two very different things happen behind the scenes:

Prefill is like a teacher reading an entire essay at once — the GPU processes all your prompt tokens in parallel, computes K/V for every token, and fills the cache in one shot. This is fast because GPUs are built for parallel work.

Decode is like writing a reply one word at a time — the model generates each output token by looking back at the entire KV cache (everything seen so far), producing one token, then repeating. This is slower because the GPU spends most of its time loading the cache from memory, not computing.

TTFT vs TPS

Time to first token (TTFT) measures how long prefill takes. A short prompt prefills quickly. A 10,000-token document takes 100× longer — the model must process every token before it can output anything.

Tokens per second (TPS) measures decode throughput — how fast the model generates after the first token. Longer context doesn't slow decode dramatically, but larger cache means more memory bandwidth consumed per step.

The two metrics are largely independent. A model can have excellent TPS but poor TTFT for long prompts, or vice versa.

When ChatGPT pauses briefly before the first word appears after you paste a long document — that pause is prefill. The model is processing your entire input and filling the KV cache before it generates even the first token. Longer paste, longer pause.

The simulator on the right panel shows the growing KV cache — this is the cache being read during decode, built during prefill.

Memory Cost

The KV cache trades computation for memory. But how much memory exactly? Let's build up the formula piece by piece for Llama-2 7B:

Each term corresponds to a real part of the model:

× 2 (K and V) — we store two vectors per token: one Key and one Value
× 32 (layers) — each transformer layer has its own cache (remember the 32-floor building from step 2)
× 32 (heads) — recall from the Attention module: each layer runs multiple attention heads in parallel, each looking for different patterns (syntax, semantics, etc.). Each head stores its own separate K/V pair.
× 128 (head size) — each K or V vector has 128 numbers. The model's full dimension (4096) is split evenly across heads: 4096 ÷ 32 heads = 128 per head
× 2 (bytes per number) — each number is stored in FP16 (half precision), which takes 2 bytes of memory

Multiply them all: 2 × 32 × 32 × 128 × 2 = 524,288 bytes ≈ 512 KB per token.

That's half a megabyte for every single token in the conversation.

What this means in practice

Llama-2 7B at 4,096 tokens: 512 KB × 4,096 = ~2.1 GB per request
GPT-3 175B (96 layers, 96 heads): ~4.7 MB/token → 4,096 tokens = ~19 GB per request — only ~5% of its ~350 GB FP16 weight footprint at this context, but the ratio climbs fast as context and concurrency grow (see below)
10 concurrent users on Llama-2 7B: 2.1 GB × 10 = 21 GB just for KV cache, more than the 14 GB FP16 weights of the model itself

Context Window = Memory

Doubling the context window doubles the KV cache. Quadrupling it quadruples it. This is linear, unlike attention compute which was quadratic before Flash Attention.

4K context → 2.1 GB (Llama-2 7B)
32K context → 16.8 GB — already larger than the 14 GB FP16 weights
128K context → 67 GB on one user's request — nearly 5× the 14 GB FP16 weight footprint of Llama-2 7B itself. Long context flips the cost ratio: cache, not weights, dominates HBM

This is why 128K context models need massive GPU clusters: the memory isn't for compute, it's for the cache.

KV cache is the hidden cost of long context. The formula is simple — 2 × layers × heads × d_head × bytes — but the numbers compound fast. 128K tokens × bytes_per_token × concurrent users = total GPU memory needed. This is why GQA, cache quantization, and paged attention exist.

The right panel shows the cache growing token by token — each block is one more token's K/V pair stored in memory.

GQA: Shrinking the Cache

From the previous step, we know the KV cache stores a K/V pair per head. More heads = more cache memory. Can we reduce the number of K/V pairs without losing quality?

The key idea: share K/V across query heads

In standard attention, every Query head (Q) has its own dedicated Key and Value heads (K/V). But what if multiple Q heads shared the same K/V?

In the diagram above, pink squares are Query heads (top row) and green squares are K/V heads (bottom row). The lines show which Q heads read from which K/V:

MHA (standard): Each Q head gets its own K/V — one-to-one. 8 Q heads = 8 K/V pairs. Full expressiveness, but the cache is large.

MQA: All Q heads share one K/V pair. The cache shrinks 8× — but all heads see the exact same Key/Value information. It's like 8 students all sharing one textbook — they all read the same page, so they all think the same way. The model loses the ability to look at the input from different angles, and quality drops.

GQA (modern): The fix for MQA's problem. Instead of forcing all heads to share one K/V, split them into small groups. Each group of 4 Q heads shares one K/V pair. Now there are 2 different K/V pairs — so the model still has 2 different perspectives on the input, not just 1. It's like splitting 8 students into 2 study groups of 4, each group with its own textbook. Each group still shares, but different groups read different material — preserving diversity while using far fewer books than giving one to each student.

In Practice

Llama-2 70B uses GQA with 8 query heads per K/V head
Mistral 7B uses GQA from the start
GPT-3 uses standard MHA — predates GQA adoption

An 8× cache reduction means 8× more concurrent users on the same hardware, or 8× longer context at the same memory budget. This is why GQA has been adopted almost universally in large models released after 2023.

GQA is why modern large models can serve long contexts at all. Without it, the KV cache for a 70B model at 128K context would exceed all practical GPU memory limits. Group size is a hardware-aware hyperparameter — tuned to balance cache size against attention quality.

The right panel shows the standard side-by-side cache — imagine the "with cache" blocks compressed 8× for a GQA model.

KV Cache in Production

The KV cache solves the computation problem but creates a memory management problem. Production serving systems spend enormous engineering effort on this cache.

Batch Size vs Context Length

GPU memory is finite. More cache per request = fewer concurrent requests. Both grids below use the same total GPU memory:

Serving providers must choose: many short conversations or few long conversations on the same hardware. This is the central serving tradeoff — throughput is maximized by batching, but batch size is constrained by cache memory per request.

KV Cache Quantization

Model weights are often quantized (FP16 → INT8). The KV cache can be quantized independently — shrinking the bytes per number stored in the cache:

KV quantization is one of the highest-leverage memory reductions because for long-context requests, the cache can be larger than the model weights themselves.

Eviction Policies

When cache memory fills up, something must be dropped. Toggle between strategies to see which tokens survive:

Remove earliest tokens first

Each strategy has tradeoffs — "drop oldest" is simple but loses early context, "drop least-attended" is smarter but requires tracking attention, and "sliding window" (used in Mistral) is the most efficient but strictly limits how far back the model can look.

Memory-Bandwidth vs Compute

During decode, the GPU does two things: load the KV cache from memory and compute the attention math. Which one is the bottleneck depends on batch size:

Phase:

Batch1

Seq512

Arithmetic intensity:4.5FLOPs/byteridge ≈ 150

→ memory-bandwidth-bound

HBM Bandwidth

100%

FLOPS Compute

Toy roofline calibrated to A100-class (HBM ~2 TB/s, ridge ≈ 150 FLOPs/byte). Other GPUs have different ridges.

Decode rule: AI scales with batch (model weights are loaded once per step and amortized across the batch). Slide seq — the gauges do not move in this toy meter. Batching is the dominant lever for decode throughput here.

With few requests, the GPU finishes computing quickly but spends most time waiting for cache data to arrive from memory. With many requests, the loading cost is shared — the GPU stays busy computing instead of waiting. This is why batching improves throughput.

Flash Attention

To understand Flash Attention, you need to know that a GPU has two types of memory:

HBM (High Bandwidth Memory) — the GPU's main memory (e.g., 80 GB on an A100). Large but relatively slow to access. This is where the KV cache lives.
SRAM — a tiny, ultra-fast scratchpad inside the GPU's compute cores (e.g., 20 MB on an A100). Data here can be read instantly, but there's very little of it.

The problem: standard attention loads the entire KV cache from HBM into SRAM, computes the scores, writes them back to HBM, loads them again for softmax, writes again... lots of back-and-forth between slow HBM and fast SRAM.

Flash Attention fixes this by computing attention in small tiles:

Instead of loading the entire cache at once (left), Flash Attention processes one tile at a time (right). Click the KV blocks on the right side to see: only one piece moves from HBM to SRAM at a time, gets computed, then the next piece takes its place. Each piece makes one round trip instead of multiple.

The result: 2–5× faster attention, much less memory for intermediate results. The KV cache still exists — Flash Attention just reads it more efficiently, like reading a book one chapter at a time instead of photocopying the entire book first.

KV cache is the central bottleneck of LLM serving. GQA, cache quantization, paged attention, sliding window attention — every major inference optimization is ultimately about managing this cache more efficiently. Understanding it is the foundation for understanding all of LLM infrastructure.