PagedAttention — Virtual Memory for LLM Serving
The Fragmentation Problem
What is PagedAttention?
Traditional KV cache allocation pre-reserves a contiguous block of GPU memory for each request's maximum possible sequence length. This causes internal fragmentation (wasted space within allocations) and external fragmentation (unusable gaps between allocations), wasting 60–80% of KV cache memory. PagedAttention, introduced by vLLM, borrows the virtual memory concept from operating systems: instead of one large contiguous allocation, KV cache is stored in small fixed-size blocks mapped through a block table. This eliminates fragmentation, enables copy-on-write for parallel sampling, and allows prefix sharing across requests — increasing serving throughput by 2–4×.
The Fragmentation Problem
Recall from the KV Cache module: every token in your prompt and every token the model generates gets its keys and values stored in GPU memory — the KV cache. That cache is what lets generation be fast.
But there's a serious problem with how that memory is managed.
You Don't Know the Response Length
When a request arrives, you have no idea if the response will be 10 tokens or 10,000. The naive solution is to pre-allocate the maximum possible length upfront — reserve a contiguous block of memory large enough to hold the longest possible response.
A request that generates 100 tokens out of a 512-token reservation wastes 412 tokens of GPU memory. That memory is locked — no other request can use it — until the sequence finishes.
You must reserve for the worst case.
What This Costs You
In the right panel, the Contiguous side shows pre-allocated blocks for each sequence. Step through the simulation and watch what happens:
- Colored blocks are actively used KV cache
- The faint blocks marked
!are pre-allocated but empty — wasted GPU memory that no other request can touch - The red fragmentation percentage shows how much of the GPU's total memory is being wasted right now
Watch the ! blocks accumulate as sequences arrive — that's memory locked but unused. The Paged side, by contrast, allocates only what's needed.
How Bad Is It?
The vLLM authors measured existing production systems and found they wasted 60–80% of GPU memory due to this fragmentation. That's not a minor inefficiency — it means only 20–40% of your GPU memory is actually serving requests.
This directly caps your throughput. More wasted memory = fewer requests in flight = lower utilization = more GPUs needed to hit the same QPS.
Fragmentation is the #1 bottleneck in LLM serving — not compute, not model size. Most of your expensive GPU memory sits idle, locked by requests that haven't used it yet.
The Root Cause
The problem isn't the KV cache itself — it's how memory is allocated: one big contiguous reservation per request, sized for the worst case. What we need is a way to allocate memory in small pieces, on demand, only as tokens are actually generated.
That's exactly what PagedAttention does. In the next step, you'll see the idea it borrowed from operating systems.
Virtual Memory for GPUs
Virtual Memory for GPUs
PagedAttention borrowed its core idea from an old trick that every modern operating system uses. Let's understand that trick first, then see how it applies to KV cache.
How OS Virtual Memory Works
Before we see PagedAttention, let's understand the trick it borrowed — because it's the same trick that makes your laptop work.
When you open Chrome, VS Code, Slack, and Spotify simultaneously, each program needs memory. The obvious approach: give each program a big contiguous chunk of RAM. But what if Chrome needs 2 GB, VS Code needs 1 GB, and you only have 8 GB total? If you pre-allocate generously, you run out fast.
Think of RAM as a pegboard with numbered slots. Each program thinks it's handed slots 1–16 in a row. Really the OS gives it whichever slots happen to be free — 3, 17, 42, 9, … — and keeps a map from "what the program thinks it has" to "where it actually lives." That map is the page table. Once you see it this way, the formal terms below are just names for this pattern:
The OS solves this with virtual memory. Here's how:
- Pages — RAM is divided into small fixed-size chunks, typically 4 KB each. Not one big block per program — thousands of tiny pages.
- Virtual addresses — each program thinks it has its own contiguous memory, starting at address 0. From Chrome's perspective, its memory is one continuous block.
- Page table — the OS maintains a lookup table that translates each program's virtual addresses to actual physical RAM locations. When Chrome reads "address 0x1000," the OS translates "virtual page 4 → physical page 17" and returns the right data. Chrome never sees the translation.
The pages are scattered across physical RAM — Chrome's page 4 might be in physical slot 17, its page 5 in physical slot 3. The program doesn't know or care.
This is why your laptop can run 50 apps simultaneously even though no single app owns all the RAM — each gets exactly the pages it needs, allocated from a shared pool, on demand.
The Same Idea for KV Cache
PagedAttention applies this exact pattern to GPU memory:
- Each sequence thinks its KV cache is contiguous (the virtual view)
- Physically, the KV cache is stored in small blocks scattered across GPU memory
- A block table maps virtual block indices to physical block locations
The attention computation runs on the virtual view — the sequence behaves as if its cache is contiguous. The block table translation happens transparently underneath.
What This Looks Like
Look at the Paged side in the right panel. Compare it to the Contiguous side:
- Physical blocks are allocated tightly with no gaps — no
!waste blocks - The Virtual → Physical mapping table below the grid shows each sequence's virtual blocks pointing to their actual physical locations
- Blocks appear only when tokens are actually generated — no pre-allocation
Focus on the block table: each sequence's virtual blocks map to scattered physical locations, but the sequence doesn't know or care. That's the same illusion the OS provides to programs.
The same trick that lets your laptop run 50 apps at once now lets a GPU serve 50 LLM requests simultaneously. Virtual memory for KV cache — an idea from the 1960s solving a 2023 problem.
In the next step, we'll look at how the block manager actually allocates and frees these blocks — and why the block size matters.
Block Tables & Allocation
Block Tables & Allocation
Now let's look at how PagedAttention actually manages those blocks — the mechanics that turn the virtual memory idea into a working system.
Fixed-Size Blocks
Each physical block holds a fixed number of tokens' worth of KV cache — typically 16 tokens. So a 1024-token sequence uses 64 blocks. A 100-token sequence uses 7 blocks (6 full + 1 partially filled).
The block size is fixed at system startup. Every block is the same size, which makes the free list simple and allocation fast.
The Block Manager
The serving system maintains a free list — a pool of available physical blocks. When a sequence needs more memory (it just generated a new token that fills a block), the block manager:
- Pops one block from the free list
- Adds an entry to that sequence's block table:
virtual block N → physical block M - The attention kernel reads through the block table transparently
When a sequence finishes generating, all its blocks are returned to the free list. They're immediately available for new requests — no fragmentation, no wasted reservation.
Waste Only in the Last Block
Compare this to the contiguous approach:
- Contiguous: pre-allocate max length → 60–80% waste
- Paged: allocate on demand → waste only in the last block per sequence
The last block of each sequence may be partially filled — a 100-token sequence fills 6 complete blocks and then uses 4 of 16 slots in block 7. That's 12 wasted slots, which is less than one block per sequence.
In practice, with many concurrent sequences, total waste is under 4%.
Block Size Tradeoff
Why 16 tokens? It's a tradeoff:
- Smaller blocks (e.g., 4 tokens): less waste per sequence, but more block table entries and more overhead during attention
- Larger blocks (e.g., 64 tokens): less overhead, but more waste in the last block and less flexibility
16 tokens is the empirical sweet spot for most workloads — small enough that waste stays minimal, large enough that the block table stays manageable.
From 60–80% waste down to under 4%. That freed memory directly translates to 2–4× more concurrent requests on the same GPU — same hardware, dramatically higher throughput.
Step through the simulation in the right panel. Watch the free list counter drop as sequences allocate blocks, then jump back up when a sequence ends and returns its blocks to the pool. Notice that paged waste stays near ~3-4% — only the last block per sequence has unused slots.
Copy-on-Write & Prefix Sharing
Copy-on-Write & Prefix Sharing
The block table abstraction unlocks two more powerful capabilities: sharing blocks across sequences, and reusing blocks across requests.
Beam Search: Sharing a Prefix
Beam search is a decoding strategy where the model generates N candidate continuations in parallel, scores them, keeps the best ones, and extends further. The name "beam" comes from the idea of a flashlight beam sweeping through a search tree — you only illuminate the N most promising paths at each step. (The technique originated in speech recognition in the 1970s.) It's used when you need high-quality outputs (translation, summarization).
All candidates start with the same prompt — the shared prefix. As they diverge, each beam generates different tokens.
The memory problem: without PagedAttention, each candidate gets its own full copy of the KV cache from the start. 8 beams = 8× the memory for the same prefix data.
With PagedAttention, all candidates can share the same physical blocks for the shared prefix. Only when candidates diverge — generating different tokens — does a new block get allocated for each divergent branch.
This is copy-on-write, the same pattern your OS uses when a process forks a child:
- Both parent and child initially point to the same physical memory pages
- When either one writes to a page, the OS makes a private copy for that process
- Until the write happens, no duplication occurs
In beam search: when two candidates generate different tokens at step K, that's the write event. Only block K gets duplicated. All earlier blocks remain shared.
The right panel shows the impact: "Without CoW" duplicates all prefix blocks per beam. "With CoW" shares the prefix — only divergent blocks are unique. Compare the block counts.
The result: up to 55% memory savings for beam search, enabling 2.2× throughput on beam-heavy workloads.
Prefix Caching: Same System Prompt
Most production deployments use a system prompt — "You are a helpful assistant..." or a multi-paragraph context for a RAG application. Every user request gets this same prefix.
Without prefix caching: each request computes KV for the full system prompt from scratch. 1000 requests = 1000× the same computation and 1000× the memory.
With prefix caching: the KV blocks for the system prompt are computed once and stored. Every new request that shares that prefix just points its block table at those same physical blocks. No recomputation, no extra memory.
This is what "prompt caching" means when you see it in OpenAI or Anthropic API pricing — the system prompt KV is computed once and cached for subsequent requests with the same prefix.
"But Every User Sends a Different Prompt?"
You might wonder: if every user types something different, how does prefix sharing help? The key is that in production, every request looks like this:
[System prompt: 500–2000 tokens, identical] + [User message: varies]
The system prompt — "You are a helpful assistant...", tool-use instructions, or a multi-page RAG context — is the same across all requests to that deployment. 1000 concurrent users means 1000 different user messages, but the same 1000-token system prompt prefix repeated 1000 times. Without prefix caching, that's 1000× the KV memory for identical data. With it, it's stored once.
Copy-on-write is even simpler: beam search by definition starts all N candidates from the same sequence, so the shared prefix is always 100% of the input.
1000 requests with the same system prompt don't cost 1000× the memory. The blocks are shared — computed once, reused for every request. This is why prompt caching is such a significant cost reduction at scale.
Why Block Sharing Works
Block sharing is only possible because of the virtual→physical indirection. Multiple sequences can have different virtual block tables that point to the same physical block. The attention kernel reads through the block table without knowing or caring that two sequences point to the same physical location.
When a sequence needs to write to a shared block (because it's diverging), the block manager detects the shared reference count > 1, allocates a new physical block, copies the content, and updates the block table. The old block's reference count drops by 1 but remains shared by the other sequences.
Pure indirection enabling pure sharing — the same reason OS virtual memory enables efficient process forking.
PagedAttention in Practice
PagedAttention in Practice
PagedAttention was published in 2023 by the team at UC Berkeley. Its first implementation — vLLM — became the reference LLM serving system almost immediately.
The Numbers
The impact was stark:
- 24× throughput vs HuggingFace Transformers (the standard research inference library)
- 3.5× throughput vs TGI (Hugging Face's production serving system at the time)
The improvement comes entirely from memory utilization. vLLM can fit 3–4× more concurrent requests on the same GPU because it wastes almost no memory on fragmentation. More concurrent requests = higher GPU utilization = higher throughput.
Block Swapping
What happens when GPU memory fills up entirely — even with paged allocation?
vLLM supports block swapping: when all physical blocks are occupied and a new request needs memory, the system can swap KV blocks from GPU memory to CPU RAM. When that request resumes (a block becomes free), the blocks are swapped back.
This lets vLLM serve requests longer than GPU memory alone would allow. The swap adds latency — CPU↔GPU transfer is slow — so it's a last resort, not the normal path. But it means the system degrades gracefully under load rather than failing.
Industry Adoption
PagedAttention's ideas are now the standard in LLM serving:
- SGLang (Stanford) uses paged KV management for structured generation
- TensorRT-LLM (NVIDIA) adopted paged KV cache for production deployments
- TGI (Hugging Face) rewrote their memory management to use paged allocation
- Most cloud LLM APIs run on vLLM or a system that copied its approach
LMSYS Chatbot Arena — one of the largest public LLM evaluation platforms — runs on vLLM and handles 30,000–60,000 requests per day with approximately 50% fewer GPUs than a comparable system without PagedAttention.
The Broader Pattern
PagedAttention solved the KV cache memory problem at exactly the right moment — as LLMs went from research curiosity to production workload. The 24× throughput improvement meant that serving costs dropped by a corresponding factor, making many applications economically viable that weren't before.
PagedAttention (2023) solved the serving memory problem the same way OS virtual memory solved the RAM problem in the 1960s. Good ideas recur — and when they do, the impact compounds.
The right panel shows the key performance numbers from the vLLM paper, plus a block swap animation — click the buttons to see how GPU blocks move to CPU RAM and back when memory pressure hits.
In the next step, we connect PagedAttention to every other module in the curriculum. The full picture, end to end.
The Full Serving Stack
The Full Serving Stack
This is the final module. You've now learned every major component of the modern LLM stack. Let's connect them into one complete picture — the exact sequence of events that happens every time you send a message to an LLM.
The right panel shows the complete 9-module pipeline — every module you've learned, connected. Module 9 (Paged Attention) is highlighted as the final piece.
The Request Lifecycle
When you type a message and hit send, here is what happens:
Every Module, One Sentence
Module 1 — Tokenization: your text is split into tokens and converted to integer IDs, because neural networks work with numbers, not characters.
Module 2 — Embeddings: each token ID is looked up in a learned table to produce a dense vector; words with similar meanings land near each other in this space.
Module 3 — Attention: tokens look at each other through Q/K/V projections — query asks "what do I need?", key answers "what do I have?", value carries the information — across multiple heads in parallel, with a causal mask so tokens can't see the future.
Module 4 — Transformer Block: each block runs attention + a feed-forward network + residual connections, then normalizes — this block repeats N times (32–96 layers in large models), each refining the representation.
Module 5 — Generation: the final layer produces logits over the vocabulary; temperature scales them; top-K/top-P narrows the candidates; one token is sampled — then the whole forward pass runs again for the next token.
Module 6 — KV Cache: instead of recomputing keys and values for the prompt on every decode step, store them after prefill; decode only needs to compute K/V for the single new token.
Module 7 — Quantization: model weights are compressed from FP16 to INT4 (4× smaller), and KV cache can be stored in FP8 (2× smaller) — same behavior, fraction of the memory.
Module 8 — Batching: continuous batching schedules multiple requests into a single GPU pass — new requests join mid-batch as slots free, eliminating the GPU idle time of static batching.
Module 9 — Paged Attention: KV cache blocks are allocated on demand from a free pool via a virtual→physical block table, reducing memory waste from 60–80% to under 4% and enabling prefix sharing across requests.
The Full System, Together
These aren't independent modules — they interact at every level:
- Quantization reduces weight size, leaving more GPU memory for KV cache
- KV Cache requires that memory to be managed efficiently — PagedAttention provides that management
- Paged blocks enable continuous batching to work well — freed blocks go back to the pool immediately
- Prefix caching (PagedAttention) means shared system prompts don't multiply memory cost with batch size
- The attention kernel (FlashAttention) reads through paged blocks with minimal overhead
A request entering a production system passes through every one of these systems in sequence, every decode step.
You now understand the entire pipeline — from the text you type to the token that appears on screen. These concepts are not theoretical. They are running right now, in data centers worldwide, serving billions of requests every day.