What is the LLM Internals track?

Nine interactive modules covering tokenization, embeddings, self-attention, the transformer block, text generation, KV cache, quantization, batching, and paged attention.

Where should I start?

Module 1 (Tokenization). Each module stands alone but builds on the prior one.

Do I need math background?

Basic linear algebra helps, but every concept is illustrated visually first.

How is PagedAttention related to OS virtual memory?

It's a direct analogy. OS virtual memory maps virtual pages to physical page frames via a page table. PagedAttention maps logical KV cache positions to physical GPU memory blocks via a block table. Both solve fragmentation by decoupling logical and physical addressing. Before PagedAttention, vLLM and other engines pre-reserved contiguous KV space per request sized to the max sequence length, wasting up to 80% of KV memory on unused slots. Paging lets blocks be allocated on demand in any physical location, cutting fragmentation to near zero and allowing 2–4× more concurrent requests on the same GPU.

What is a block table?

A per-request lookup table that maps each logical block index to a physical block location in GPU memory. When attention needs KV values for a position, it consults the block table to find the actual memory address. Each block typically holds KV values for 16 tokens, so a 2K-token request uses 128 block-table entries — a tiny overhead compared to the KV cache itself. The indirection enables copy-on-write for shared prefixes, free eviction and reloading for preemption, and efficient allocation at any granularity the scheduler prefers.

What is copy-on-write in PagedAttention?

When multiple outputs share the same prompt (e.g., beam search or parallel sampling), they share the same physical KV cache blocks for the shared prefix. A block is only copied when one sequence needs to write different values — saving memory proportional to the shared prefix length.

What is prefix sharing?

When multiple requests start with the same prompt (e.g., a system prompt), PagedAttention can reuse the same physical KV cache blocks across all of them. This is especially valuable for chat systems where many users share the same system prompt. A 2K-token system prompt that would otherwise cost 2K tokens of prefill per request becomes a single shared set of KV blocks, hashed and matched at block granularity. In a multi-tenant API with heavy template reuse, prefix sharing can cut prefill compute by 50–90%, which is why it underlies production features like Anthropic's cache_control and vLLM's automatic prefix caching.

Does PagedAttention add overhead?

Minimal. The block table lookup adds a small indirection cost per attention operation. In practice, this overhead is far outweighed by the memory savings, which allow serving 2–4× more concurrent requests. The original vLLM paper measured per-kernel overhead at a few percent, which is easily recouped once batch size grows by the 2–4× that paging makes possible. The overhead is further amortized in decode because the block table is tiny and fits in L2 cache. As a rule: any serving scenario with variable-length sequences or shared prefixes benefits from paging.

PagedAttention — Virtual Memory for LLM Serving

The Fragmentation Problem

What is PagedAttention?

Traditional KV cache allocation pre-reserves a contiguous block of GPU memory for each request's maximum possible sequence length. This causes internal fragmentation (wasted space within allocations) and external fragmentation (unusable gaps between allocations), wasting 60–80% of KV cache memory. PagedAttention, introduced by vLLM, borrows the virtual memory concept from operating systems: instead of one large contiguous allocation, KV cache is stored in small fixed-size blocks mapped through a block table. This eliminates fragmentation, enables copy-on-write for parallel sampling, and allows prefix sharing across requests — increasing serving throughput by 2–4×.

The Fragmentation Problem

Recall from the KV Cache module: every token in your prompt and every token the model generates gets its keys and values stored in GPU memory — the KV cache. That cache is what lets generation be fast.

But there's a serious problem with how that memory is managed.

You Don't Know the Response Length

When a request arrives, you have no idea if the response will be 10 tokens or 10,000. The naive solution is to pre-allocate the maximum possible length upfront — reserve a contiguous block of memory large enough to hold the longest possible response.

A request that generates 100 tokens out of a 512-token reservation wastes 412 tokens of GPU memory. That memory is locked — no other request can use it — until the sequence finishes.

Pre-allocation: 512-token reservation, only 100 tokens generated

← 100 tokens used412 tokens wasted (locked) →

Response could be 10 or 10,000 tokens.
You must reserve for the worst case.

What This Costs You

In the right panel, the Contiguous side shows pre-allocated blocks for each sequence. Step through the simulation and watch what happens:

Colored blocks are actively used KV cache
The faint blocks marked ! are pre-allocated but empty — wasted GPU memory that no other request can touch
The red fragmentation percentage shows how much of the GPU's total memory is being wasted right now

Watch the ! blocks accumulate as sequences arrive — that's memory locked but unused. The Paged side, by contrast, allocates only what's needed.

How Bad Is It?

The vLLM authors measured existing production systems and found they wasted 60–80% of GPU memory due to this fragmentation. That's not a minor inefficiency — it means only 20–40% of your GPU memory is actually serving requests.

This directly caps your throughput. More wasted memory = fewer requests in flight = lower utilization = more GPUs needed to hit the same QPS.

Fragmentation is the #1 bottleneck in LLM serving — not compute, not model size. Most of your expensive GPU memory sits idle, locked by requests that haven't used it yet.

The Root Cause

The problem isn't the KV cache itself — it's how memory is allocated: one big contiguous reservation per request, sized for the worst case. What we need is a way to allocate memory in small pieces, on demand, only as tokens are actually generated.

That's exactly what PagedAttention does. In the next step, you'll see the idea it borrowed from operating systems.

Virtual Memory for GPUs

PagedAttention borrowed its core idea from an old trick that every modern operating system uses. Let's understand that trick first, then see how it applies to KV cache.

How OS Virtual Memory Works

Before we see PagedAttention, let's understand the trick it borrowed — because it's the same trick that makes your laptop work.

When you open Chrome, VS Code, Slack, and Spotify simultaneously, each program needs memory. The obvious approach: give each program a big contiguous chunk of RAM. But what if Chrome needs 2 GB, VS Code needs 1 GB, and you only have 8 GB total? If you pre-allocate generously, you run out fast.

Think of RAM as a pegboard with numbered slots. Each program thinks it's handed slots 1–16 in a row. Really the OS gives it whichever slots happen to be free — 3, 17, 42, 9, … — and keeps a map from "what the program thinks it has" to "where it actually lives." That map is the page table. Once you see it this way, the formal terms below are just names for this pattern:

CHROME'S VIEW

Virtual page 4

→

ACTUAL RAM

Physical slot 17

The OS solves this with virtual memory. Here's how:

Pages — RAM is divided into small fixed-size chunks, typically 4 KB each. Not one big block per program — thousands of tiny pages.
Virtual addresses — each program thinks it has its own contiguous memory, starting at address 0. From Chrome's perspective, its memory is one continuous block.
Page table — the OS maintains a lookup table that translates each program's virtual addresses to actual physical RAM locations. When Chrome reads "address 0x1000," the OS translates "virtual page 4 → physical page 17" and returns the right data. Chrome never sees the translation.

The pages are scattered across physical RAM — Chrome's page 4 might be in physical slot 17, its page 5 in physical slot 3. The program doesn't know or care.

OS Virtual Memory — 4 programs sharing physical RAM

What each program sees (virtual)

Chrome

VS Code

Slack

Spotify

Each app sees pages 0, 1, 2 — contiguous

Physical RAM (actual)

Pages scattered — C0 at slot 2, C1 at slot 7...

Page table translates:Chrome v0 → p2, VS Code v0 → p0, Slack v0 → p3

This is why your laptop can run 50 apps simultaneously even though no single app owns all the RAM — each gets exactly the pages it needs, allocated from a shared pool, on demand.

The Same Idea for KV Cache

PagedAttention applies this exact pattern to GPU memory:

Each sequence thinks its KV cache is contiguous (the virtual view)
Physically, the KV cache is stored in small blocks scattered across GPU memory
A block table maps virtual block indices to physical block locations

The attention computation runs on the virtual view — the sequence behaves as if its cache is contiguous. The block table translation happens transparently underneath.

What This Looks Like

Look at the Paged side in the right panel. Compare it to the Contiguous side:

Physical blocks are allocated tightly with no gaps — no ! waste blocks
The Virtual → Physical mapping table below the grid shows each sequence's virtual blocks pointing to their actual physical locations
Blocks appear only when tokens are actually generated — no pre-allocation

Focus on the block table: each sequence's virtual blocks map to scattered physical locations, but the sequence doesn't know or care. That's the same illusion the OS provides to programs.

The same trick that lets your laptop run 50 apps at once now lets a GPU serve 50 LLM requests simultaneously. Virtual memory for KV cache — an idea from the 1960s solving a 2023 problem.

In the next step, we'll look at how the block manager actually allocates and frees these blocks — and why the block size matters.

Block Tables & Allocation

Now let's look at how PagedAttention actually manages those blocks — the mechanics that turn the virtual memory idea into a working system.

Fixed-Size Blocks

Each physical block holds a fixed number of tokens' worth of KV cache — typically 16 tokens. So a 1024-token sequence uses 64 blocks. A 100-token sequence uses 7 blocks (6 full + 1 partially filled).

100-token sequence → 7 blocks (block size = 16 tokens)

16/16

4/16

6 full blocks (96 tokens)last block waste (12 slots)

The block size is fixed at system startup. Every block is the same size, which makes the free list simple and allocation fast.

The Block Manager

The serving system maintains a free list — a pool of available physical blocks. When a sequence needs more memory (it just generated a new token that fills a block), the block manager:

Pops one block from the free list
Adds an entry to that sequence's block table: virtual block N → physical block M
The attention kernel reads through the block table transparently

When a sequence finishes generating, all its blocks are returned to the free list. They're immediately available for new requests — no fragmentation, no wasted reservation.

Waste Only in the Last Block

Compare this to the contiguous approach:

Contiguous: pre-allocate max length → 60–80% waste
Paged: allocate on demand → waste only in the last block per sequence

The last block of each sequence may be partially filled — a 100-token sequence fills 6 complete blocks and then uses 4 of 16 slots in block 7. That's 12 wasted slots, which is less than one block per sequence.

In practice, with many concurrent sequences, total waste is under 4%.

Block Size Tradeoff

Why 16 tokens? It's a tradeoff:

Smaller blocks (e.g., 4 tokens): less waste per sequence, but more block table entries and more overhead during attention
Larger blocks (e.g., 64 tokens): less overhead, but more waste in the last block and less flexibility

16 tokens is the empirical sweet spot for most workloads — small enough that waste stays minimal, large enough that the block table stays manageable.

From 60–80% waste down to under 4%. That freed memory directly translates to 2–4× more concurrent requests on the same GPU — same hardware, dramatically higher throughput.

Step through the simulation in the right panel. Watch the free list counter drop as sequences allocate blocks, then jump back up when a sequence ends and returns its blocks to the pool. Notice that paged waste stays near ~3-4% — only the last block per sequence has unused slots.

Copy-on-Write & Prefix Sharing

The block table abstraction unlocks two more powerful capabilities: sharing blocks across sequences, and reusing blocks across requests.

Beam Search: Sharing a Prefix

Beam search is a decoding strategy where the model generates N candidate continuations in parallel, scores them, keeps the best ones, and extends further. The name "beam" comes from the idea of a flashlight beam sweeping through a search tree — you only illuminate the N most promising paths at each step. (The technique originated in speech recognition in the 1970s.) It's used when you need high-quality outputs (translation, summarization).

Beam Search — shared prefix, then fork

the

cat

sat

shared prefix (computed once)

the

Beam 1

Beam 2

Beam 3

All candidates start with the same prompt — the shared prefix. As they diverge, each beam generates different tokens.

The memory problem: without PagedAttention, each candidate gets its own full copy of the KV cache from the start. 8 beams = 8× the memory for the same prefix data.

With PagedAttention, all candidates can share the same physical blocks for the shared prefix. Only when candidates diverge — generating different tokens — does a new block get allocated for each divergent branch.

This is copy-on-write, the same pattern your OS uses when a process forks a child:

Both parent and child initially point to the same physical memory pages
When either one writes to a page, the OS makes a private copy for that process
Until the write happens, no duplication occurs

In beam search: when two candidates generate different tokens at step K, that's the write event. Only block K gets duplicated. All earlier blocks remain shared.

The right panel shows the impact: "Without CoW" duplicates all prefix blocks per beam. "With CoW" shares the prefix — only divergent blocks are unique. Compare the block counts.

The result: up to 55% memory savings for beam search, enabling 2.2× throughput on beam-heavy workloads.

Prefix Caching: Same System Prompt

Most production deployments use a system prompt — "You are a helpful assistant..." or a multi-paragraph context for a RAG application. Every user request gets this same prefix.

Without prefix caching: each request computes KV for the full system prompt from scratch. 1000 requests = 1000× the same computation and 1000× the memory.

With prefix caching: the KV blocks for the system prompt are computed once and stored. Every new request that shares that prefix just points its block table at those same physical blocks. No recomputation, no extra memory.

This is what "prompt caching" means when you see it in OpenAI or Anthropic API pricing — the system prompt KV is computed once and cached for subsequent requests with the same prefix.

"But Every User Sends a Different Prompt?"

You might wonder: if every user types something different, how does prefix sharing help? The key is that in production, every request looks like this:

[System prompt: 500–2000 tokens, identical] + [User message: varies]

The system prompt — "You are a helpful assistant...", tool-use instructions, or a multi-page RAG context — is the same across all requests to that deployment. 1000 concurrent users means 1000 different user messages, but the same 1000-token system prompt prefix repeated 1000 times. Without prefix caching, that's 1000× the KV memory for identical data. With it, it's stored once.

Copy-on-write is even simpler: beam search by definition starts all N candidates from the same sequence, so the shared prefix is always 100% of the input.

1000 requests with the same system prompt don't cost 1000× the memory. The blocks are shared — computed once, reused for every request. This is why prompt caching is such a significant cost reduction at scale.

Why Block Sharing Works

Block sharing is only possible because of the virtual→physical indirection. Multiple sequences can have different virtual block tables that point to the same physical block. The attention kernel reads through the block table without knowing or caring that two sequences point to the same physical location.

When a sequence needs to write to a shared block (because it's diverging), the block manager detects the shared reference count > 1, allocates a new physical block, copies the content, and updates the block table. The old block's reference count drops by 1 but remains shared by the other sequences.

Pure indirection enabling pure sharing — the same reason OS virtual memory enables efficient process forking.

PagedAttention in Practice

PagedAttention was published in 2023 by the team at UC Berkeley. Its first implementation — vLLM — became the reference LLM serving system almost immediately.

The Numbers

The impact was stark:

24× throughput vs HuggingFace Transformers (the standard research inference library)
3.5× throughput vs TGI (Hugging Face's production serving system at the time)

The improvement comes entirely from memory utilization. vLLM can fit 3–4× more concurrent requests on the same GPU because it wastes almost no memory on fragmentation. More concurrent requests = higher GPU utilization = higher throughput.

Block Swapping

What happens when GPU memory fills up entirely — even with paged allocation?

vLLM supports block swapping: when all physical blocks are occupied and a new request needs memory, the system can swap KV blocks from GPU memory to CPU RAM. When that request resumes (a block becomes free), the blocks are swapped back.

This lets vLLM serve requests longer than GPU memory alone would allow. The swap adds latency — CPU↔GPU transfer is slow — so it's a last resort, not the normal path. But it means the system degrades gracefully under load rather than failing.

Industry Adoption

PagedAttention's ideas are now the standard in LLM serving:

SGLang (Stanford) uses paged KV management for structured generation
TensorRT-LLM (NVIDIA) adopted paged KV cache for production deployments
TGI (Hugging Face) rewrote their memory management to use paged allocation
Most cloud LLM APIs run on vLLM or a system that copied its approach

LMSYS Chatbot Arena — one of the largest public LLM evaluation platforms — runs on vLLM and handles 30,000–60,000 requests per day with approximately 50% fewer GPUs than a comparable system without PagedAttention.

The Broader Pattern

PagedAttention solved the KV cache memory problem at exactly the right moment — as LLMs went from research curiosity to production workload. The 24× throughput improvement meant that serving costs dropped by a corresponding factor, making many applications economically viable that weren't before.

PagedAttention (2023) solved the serving memory problem the same way OS virtual memory solved the RAM problem in the 1960s. Good ideas recur — and when they do, the impact compounds.

The right panel shows the key performance numbers from the vLLM paper, plus a block swap animation — click the buttons to see how GPU blocks move to CPU RAM and back when memory pressure hits.

In the next step, we connect PagedAttention to every other module in the curriculum. The full picture, end to end.

The Full Serving Stack

This is the final module. You've now learned every major component of the modern LLM stack. Let's connect them into one complete picture — the exact sequence of events that happens every time you send a message to an LLM.

The right panel shows the complete 9-module pipeline — every module you've learned, connected. Module 9 (Paged Attention) is highlighted as the final piece.

The Request Lifecycle

When you type a message and hit send, here is what happens:

Every Module, One Sentence

Module 1 — Tokenization: your text is split into tokens and converted to integer IDs, because neural networks work with numbers, not characters.

Module 2 — Embeddings: each token ID is looked up in a learned table to produce a dense vector; words with similar meanings land near each other in this space.

Module 3 — Attention: tokens look at each other through Q/K/V projections — query asks "what do I need?", key answers "what do I have?", value carries the information — across multiple heads in parallel, with a causal mask so tokens can't see the future.

Module 4 — Transformer Block: each block runs attention + a feed-forward network + residual connections, then normalizes — this block repeats N times (32–96 layers in large models), each refining the representation.

Module 5 — Generation: the final layer produces logits over the vocabulary; temperature scales them; top-K/top-P narrows the candidates; one token is sampled — then the whole forward pass runs again for the next token.

Module 6 — KV Cache: instead of recomputing keys and values for the prompt on every decode step, store them after prefill; decode only needs to compute K/V for the single new token.

Module 7 — Quantization: model weights are compressed from FP16 to INT4 (4× smaller), and KV cache can be stored in FP8 (2× smaller) — same behavior, fraction of the memory.

Module 8 — Batching: continuous batching schedules multiple requests into a single GPU pass — new requests join mid-batch as slots free, eliminating the GPU idle time of static batching.

Module 9 — Paged Attention: KV cache blocks are allocated on demand from a free pool via a virtual→physical block table, reducing memory waste from 60–80% to under 4% and enabling prefix sharing across requests.

The Full System, Together

These aren't independent modules — they interact at every level:

Quantization reduces weight size, leaving more GPU memory for KV cache
KV Cache requires that memory to be managed efficiently — PagedAttention provides that management
Paged blocks enable continuous batching to work well — freed blocks go back to the pool immediately
Prefix caching (PagedAttention) means shared system prompts don't multiply memory cost with batch size
The attention kernel (FlashAttention) reads through paged blocks with minimal overhead

A request entering a production system passes through every one of these systems in sequence, every decode step.

You now understand the entire pipeline — from the text you type to the token that appears on screen. These concepts are not theoretical. They are running right now, in data centers worldwide, serving billions of requests every day.