What is the LLM Serving track?

Seven interactive modules covering the production side of LLM inference: engine internals, speculative decoding, prefill/decode disaggregation, serving metrics and SLOs, CUDA Graphs, multi-LoRA serving, and prefix caching.

Do I need to finish LLM Internals first?

Not strictly, but understanding KV cache, batching, and PagedAttention from the LLM Internals track makes these serving topics click faster.

Who is this track for?

Engineers running LLM inference in production — people tuning vLLM / SGLang / TensorRT-LLM, choosing hardware, or debugging TTFT and P99 latency.

How is prefix caching different from the KV cache?

The KV cache reuses key-value pairs within a single request — avoiding recomputation of earlier tokens as generation proceeds. Prefix caching reuses KV blocks across different requests that share a common prefix. The KV cache is per-request; prefix caching is cross-request. Both are built on the same KV blocks, so implementations piggyback prefix caching on PagedAttention — blocks gain a content hash and a reference count, and matching prefixes just bump the refcount instead of reallocating. This layering is what lets vLLM's automatic prefix caching add cross-request savings with almost zero new memory machinery.

Is prefix caching automatic?

It depends on the provider. OpenAI and Gemini enable it automatically for prompts over a minimum length with no API changes needed. Anthropic requires explicit cache_control markers in the prompt to designate cacheable boundaries. Gemini also offers an explicit CachedContent API for fine-grained control. Explicit markers trade a bit of developer work for predictable hit rates — you know exactly which prefix boundaries are cached. Automatic caching is easier to adopt but opaque, so if your prompts drift even slightly (timestamps, user IDs near the front) you may see hit rates collapse without warning. Either way, provider-side billing reflects the savings.

What breaks prefix caching?

Anything that makes the prefix unique per request: timestamps or request IDs placed near the front of the prompt, per-user content injected into what should be a shared system prompt, and prompts below the provider's minimum cacheable size (typically 1,024–4,096 tokens depending on provider). Even a single different token at the start of the prompt invalidates the entire cache hit.

Is prefix caching safe for multi-tenant serving?

It requires careful implementation. vLLM added cache_salt (RFC #16016) to prevent cross-tenant timing attacks — without salting, an adversary could infer whether another tenant's request shared a prefix by measuring response time differences. Anthropic isolates prefix caches at the organization level, with a planned move to workspace-level isolation as of February 2026.

Prefix Caching & RadixAttention Explained

The Shared-Prefix Waste

What is prefix caching?

Prefix caching is a serving-layer optimization that reuses the KV cache computed for a shared prompt prefix across many requests. When thousands of users send messages that all begin with the same 1,000-token system prompt, a naive server recomputes those 1,000 tokens 1,000 times. A prefix-caching server computes them once, stores the resulting KV tensors, and reuses them for every subsequent request that shares the prefix. In production, this is why Anthropic, OpenAI, and Gemini all advertise cache-read pricing at 10–50% of the base input rate — it reflects the GPU work the server genuinely doesn't have to do.

The waste, in numbers

Imagine your product has a 1,000-token system prompt ("You are a helpful assistant working with engineers. Here are your rules... Here's how to format responses... Here are your tools..."). Ten thousand users send a message each hour. Without prefix caching:

Work done per request: prefill 1,020 tokens (1,000 system + ~20 user message)
Work done per hour: 10,000 × 1,020 = 10.2M prefill tokens

With prefix caching:

Work done per hour: 1,000 (first request) + 10,000 × 20 (user messages) = 201,000 prefill tokens
Savings: ~98%

The GPU time saved lines up with the bill you get from your provider. Anthropic's cache-read multiplier is 0.1×, OpenAI's is 0.5×, Gemini's (explicit) is 0.1× — all reflect genuine compute you no longer have to pay for.

"Cached" means the server keeps the KV tensors (the output of running tokens through every attention layer) in GPU memory, not the prompt text. When a new request arrives with the same prefix, the server skips the prefill stage for those tokens and jumps straight into decode with the cached KVs already populated.

Recap: what prefill does (and why it's expensive)

Recall from the Prefill/Decode Disaggregation module: every request has two phases. Prefill processes the entire prompt in one shot — the server runs every prompt token through every transformer layer to produce the KV cache. Decode then generates output tokens one at a time. Prefill is compute-bound and scales linearly with prompt length, so a 1,000-token system prompt costs roughly 50× as much prefill work as a 20-token user message. That's where the waste lives.

Try it on the right

Click the + Add request buttons on the right panel to simulate incoming requests, all sharing the same 1,000-token system prompt but with different user messages. Toggle cache OFF ↔ cache ON to see the prefill work collapse.

Watch the counters:

No cache grows linearly with each added request.
With cache stays almost flat — only the ~20-token user message is new.
Savings climbs toward 98% as you add more requests.

The bigger the shared prefix and the more requests share it, the bigger the win. A 4,000-token agent scratchpad shared across 100,000 requests isn't a rounding error on your GPU bill — it's the whole bill.

Next: we'll see why the obvious implementation — "hash the whole prompt, look it up in a table" — doesn't work.

Why Full-Prompt Hashing Fails

The obvious first try

If you've built a caching layer before, your first instinct is reasonable: hash the prompt, use the hash as the cache key, look it up in a hash table.

key = hash(prompt_tokens)
if key in cache:
    return cache[key]  # reuse the KV tensors
else:
    kv = prefill(prompt_tokens)
    cache[key] = kv

This works perfectly — if the entire prompt is identical. If user A and user B both send exactly the same string, they'll hit the same cache entry.

But that's almost never the production reality. In production, prompts share a prefix and diverge near the end:

Prompt A: "You are a helpful assistant. I need help with debugging a Python script that parses CSV files..."
Prompt B: "You are a helpful assistant. I need help with debugging my React app that renders a list..."

The first 9 tokens are identical. Only tokens 10+ differ.

Why exact hashing throws away the shared work

A hash function is a one-way randomizer: any change to the input produces a completely different output. It has no notion of "similar" or "partial match." If two prompts share 999 of their 1,000 tokens but differ at a single position, their full-prompt hashes are as different as if they shared nothing at all.

That means the cache always misses. The 999 tokens of shared work get thrown away every single time.

Two prompts, different hashes

Prompt A

Youarehelpful.Alicesaidhi

hash: e93c

Prompt B

Youarehelpful.Bobsaidhi

hash: 8800

CACHE MISS — hashes differ, shared prefix wasted

Both prompts share 3 leading tokens, but one different token anywhere makes the full-prompt hash completely different.

Try it on the right

Slide the divergence slider on the right panel. You'll see two prompts that share the first N tokens and diverge from position N onward. Below each prompt, the full-prompt hash is computed from the pill labels.

Notice:

The hashes never match, no matter where you put the divergence.
The red CACHE MISS stamp stays on forever.
Even at slider position 9 — where only 1 of 10 tokens differs — the hashes are entirely unrelated.

This is the "0 or 1" problem: exact-hash keys can only express "identical" or "different." They can't express "first 9 tokens identical."

The problem is structural, not a choice of hash function. SHA-256, MurmurHash, FNV — every collision-resistant hash has this property by design. You can't fix this by picking a different hash; you have to pick a different data structure.

What we actually need

We need a cache key that can answer a different question: "given this new prompt, what is the longest prefix that already exists in the cache?" That requires walking the prompt token by token, matching against something that can branch.

Two production engines answer this two different ways:

SGLang's RadixAttention builds a radix tree that represents all cached prefixes as paths from the root. A new request walks the tree from the root, matching tokens as it goes. Where it diverges, the tree splits.
vLLM's Automatic Prefix Caching keeps full-prompt hashing — but at block granularity, chaining the hashes together so two prompts with matching first 62 blocks get 62 cache hits automatically.

Both avoid the "0 or 1" trap. The next two steps walk through each approach.

The Prefix Tree (RadixAttention)

The shape of the answer

We need a data structure where cached prefixes are paths and new requests walk the paths as far as they match. Where a new request diverges from every existing path, the structure creates a new branch.

This is a tree. Each edge holds a run of tokens; each internal node marks a branching point where two or more cached requests once diverged. Each leaf corresponds to a complete cached request.

Trie vs radix tree

A trie is the simplest version: one token per edge. Walking a trie for a 1,000-token prompt touches 1,000 nodes and 1,000 edges. That's a lot of bookkeeping.

A radix tree compresses any chain of single-child nodes into a single edge labeled with the full token sequence. Two cached requests that share a 950-token prefix but differ at token 951 use one edge of length 950 to the split point, not 950 one-token edges. "Radix" means "compressed single-child chains into one edge."

SGLang (LMSYS, January 2024) called their version RadixAttention and reported up to 5× throughput over vLLM at publication time on shared-prefix workloads like multi-turn chat, few-shot evaluation, and agent loops. Hit rates in production range from 50% to 99% depending on workload.

Insert walk: match, split, or append

When a new request arrives with its token sequence, the server walks the tree from the root:

Find the child edge whose first token matches the request's next token.
Compare the edge's tokens with the request's remaining tokens.
- Full match, consumes the edge: descend to that child, continue with remaining tokens.
- Partial match (diverge mid-edge): the edge must split. Create a new internal node at the divergence point. The old edge becomes a child holding its suffix; a new sibling edge is created for the new request's novel suffix.
- No matching child: attach a new leaf from the current node with the request's remaining tokens.

The split case is the one to watch. It's what lets the tree grow more structure in response to new requests without losing any cached work.

Each internal node costs a small bookkeeping record (token list, child pointers, reference counts, block pointers — a few dozen bytes). The tree trades that overhead for the ability to answer "longest matching prefix" in one walk from the root. For a typical production cache with thousands of entries, the bookkeeping is a rounding error compared to the KV tensors themselves.

What the edges actually hold

Recall from the PagedAttention module: the KV cache isn't a single giant tensor — it lives in fixed-size physical blocks (typically 16 tokens of KV per block) scattered across GPU HBM. Each sequence has a block table that maps its logical positions to those physical blocks.

RadixAttention integrates with this. Each edge in the tree points to a run of physical KV blocks — the exact blocks that hold the KV tensors for the tokens on that edge. Two requests that share an edge share the same physical blocks; they don't copy KV data, they share pointers to it.

Try it on the right

Click the R1 through R6 buttons on the right panel to insert requests into a live radix tree:

R1 on an empty tree → a single edge from root to a leaf holding all its tokens.
R2 shares [You, are, helpful] with R1 → the edge splits at "helpful." You get a new internal node and two sibling leaves.
R3 extends the sharing → three siblings under the same internal node.
R4 shares only [You, are] with the earlier group → a deeper split, creating a second subtree for "You are a code reviewer…"
R6 shares nothing with anyone → a new top-level branch off the root.

Click the same request twice — the second click is a HIT (green flash). All tokens served from cache, zero new prefill work.

Watch the counters:

Cached tokens grows with each new request but grows less when edges split (shared tokens aren't duplicated).
Cache hits counts re-clicks on previously-added requests.

The tree is the cache. Every path from root to leaf is a cached request, and every shared edge is storage you didn't have to pay for twice.

Next: vLLM took the same problem and solved it differently — without a tree, using block-level Merkle-style hashing.

Block-Hash Chain (vLLM APC)

A different answer to the same problem

SGLang reached for a tree. vLLM reached for a hash table — but a smarter one. vLLM's Automatic Prefix Caching (APC) keeps the simplicity of hashing while avoiding the "0 or 1" trap by hashing at block granularity instead of full-prompt granularity.

The key idea: the KV cache already lives in fixed-size physical blocks (16 tokens each, from PagedAttention). So hash each block individually, and chain the hashes together. Two prompts that share the first 62 blocks of 16 tokens produce the same first 62 block hashes — guaranteed — and those blocks hit the cache one by one.

Merkle-style chaining

Each block's hash depends on the tokens in the block and on the previous block's hash:

block_hash_0 = H(seed,          tokens_in_block_0, extras)
block_hash_1 = H(block_hash_0,  tokens_in_block_1, extras)
block_hash_2 = H(block_hash_1,  tokens_in_block_2, extras)
...

This is Merkle-style chaining: each block's hash is a commitment to everything that came before it. A prompt's full list of block hashes is a fingerprint of its entire prefix, position by position.

The extras term includes a few things that have to be part of the cache key:

The active LoRA adapter ID (same tokens but different adapter → different KV → different cache entry).
The multimodal input hash (images in the prompt influence KV values).
An optional cache_salt — vLLM added this (RFC #16016) to isolate tenants. Without a salt, a multi-tenant serving deployment leaks information via cache timing: tenant A can detect whether tenant B has recently sent a particular prefix by measuring response latency. Per-tenant salts make each tenant's block hashes disjoint.

Divergence is free

Now compare two prompts that share tokens 0–991 (62 full blocks of 16) but diverge at token 992:

blocks 0–61: same tokens → same hashes → 62 cache hits.
block 62: different tokens → different hash → cache miss → allocate a new physical block and add it to the global hash table.
blocks 63+: built on different parent hashes → different hashes → miss, allocate, continue.

No tree. No edge splitting. No node-creation bookkeeping. Just hash chain computed once per block, and a single global hash table mapping block_hash → physical_block_id.

Lookup is O(1) per block. Divergence is free.

The trade-off: you only hit on block boundaries (multiples of 16 tokens). Two prompts that share 995 tokens — 62 full blocks plus 3 tokens of the 63rd block — get 62 hits, not 62-and-three-sixteenths. The 3 shared tokens in block 63 mix with the divergent tokens and produce a different hash, so block 63 misses. This is the price of simplicity: you lose sub-block granularity, which the radix tree preserves.

The global hash table

There's just one hash table per GPU (or per cache shard), keyed on block_hash and pointing to the physical block that holds those KV tensors. When a new request computes its block hashes, each one either hits an existing entry (reuse that physical block) or misses (allocate a new physical block and insert into the table).

No tree traversal. No node-per-edge bookkeeping. Just a dictionary.

Try it on the right

Hit ▶ play on the right panel to step through two requests being processed block by block:

Request A (pink) goes first. Every block is a MISS (cache is empty) — 5 new entries appear in the global hash table.
Request B (sky) shares the first 3 "system prompt" blocks with A. Watch those blocks flash green (HIT), reusing A's physical blocks. Blocks 4–5 differ (user message text) → MISS, new entries allocated.

The hash table on the right grows only on misses. B's first 3 blocks don't add new entries — they reuse.

You can also step through manually to read each block's hash and check how chaining works.

When to prefer block-hash over a tree

Block-hash chaining is the default in vLLM v1. Engines that chose it over a tree cite:

Implementation simplicity — a dictionary is easier than a thread-safe radix tree with edge splitting.
Shardability — a global hash table distributes across cache shards by hash prefix; the tree doesn't shard cleanly.
PagedAttention integration — physical blocks already exist; block-hash is a natural key.

The SGLang team chose the tree for different reasons — cache-aware scheduling (Step 5 explains) and sub-block matching for workloads where that matters. Both are production-proven; you'll find TTFT reductions of ~78% and throughput gains of ~254% reported by llm-d for vLLM APC on shared-system-prompt workloads, and 5× end-to-end gains reported by SGLang at publication time.

--enable-prefix-caching

In vLLM v1 this flag is on by default. You don't need to do anything to turn it on; you just need to structure your prompts so the shared prefix comes first.

Next: eviction. When the cache fills up, how do we know what's safe to remove?

Eviction and Copy-on-Write

The problem: caches fill up

GPU HBM is finite. A typical H100 has 80 GB, and you want to spend most of it on active in-flight decoding, not on cached prefixes from requests that finished an hour ago. So the cache must evict — remove old entries to make room for new ones.

But eviction has a safety constraint: you can't evict KV blocks that an in-flight request is currently reading from. If you evict a block mid-decode, the next attention kernel reads garbage and the generation corrupts. Naive LRU is unsafe.

Both SGLang and vLLM solve this with "live-reference safety" — they just express it differently.

SGLang: reference counts per tree node

SGLang's radix tree adds a refcount to every node. When a request is actively using a path from the root to some leaf, it increments the refcount of every node on that path. When the request finishes, it decrements them.

The eviction rule: a node is evictable only if refcount == 0, and you can only evict from the leaves. You can never evict an internal node that still has children — its tokens are shared by every descendant.

LRU runs on the set of zero-refcount leaves: pick the least-recently-used one, evict it, recurse up the tree (if its parent becomes a childless node and also has refcount 0, it becomes evictable too on the next pass).

vLLM: pinning

vLLM's block-hash approach doesn't have a tree, but it has the same safety requirement. When an in-flight request reads a physical block, vLLM pins that block (the scheduler marks it unevictable for the duration of the request). When the request finishes, the block is unpinned and re-joins the LRU queue.

LRU over unpinned blocks is identical in spirit to SGLang's LRU-over-zero-refcount-leaves: same invariant, different name.

Both names describe the same idea engineers have seen elsewhere: reference counting in garbage collectors, pin/unpin in OS page caches, retain/release in Objective-C / Swift. The novelty isn't the mechanism — it's applying a 50-year-old safety primitive to KV blocks.

Copy-on-write at branching points

What happens when two requests share a cached prefix but one of them starts writing (generating new tokens) in a way that requires modifying blocks that were previously shared?

This happens in beam search (multiple hypotheses branching from a shared prefix, each writing different tokens) and in any workflow where two requests share a prefix but need private memory going forward.

Copy-on-write (CoW): the shared blocks stay shared. When the first request needs to write a divergent token into what was a shared block, the server allocates a new physical block, copies the existing contents over (up to the write position), applies the write, and updates that request's block table to point to the new block. The other request is unaffected; its block table still points at the original.

You've seen this pattern before — it's the same mechanism OS virtual memory uses for fork(). The cost is deferred until a write actually diverges.

Memory-pressure thrash

Under heavy decode load, the cache is competing with active request KV for HBM space. When the system is memory-stressed, idle cached prefixes get evicted to make room for the next decode batch. If a popular prefix (e.g., your main product's system prompt) gets evicted and then re-requested shortly after, the server has to re-prefill it — wasting the savings.

This is cache thrash: the cache is big enough for the working set in aggregate, but the replacement policy plus scheduling order causes the hot prefixes to keep dropping out. You see it in the numbers as an unstable cache hit rate that swings with traffic.

The mitigation: cache-aware scheduling. SGLang's scheduler, given a batch of pending requests, prefers to run the ones that share prefixes with currently-cached paths. That keeps the hot prefixes pinned (via their refcounts) and out of the eviction candidate pool. It's a scheduling answer to a caching problem.

This coupling is worth calling out explicitly: prefix caching and batch scheduling are not independent subsystems. An engine with a beautiful cache and a naive "FIFO request order" scheduler will thrash on workloads where request order doesn't correlate with prefix reuse. The real win is when both are designed together.

Try it on the right

The right panel has a pre-populated radix tree with 8 leaves across 4 branches.

Start in pressure LOW — nothing evicts, no timer, leaves sit quietly.
Switch to pressure HIGH — every ~1.2 seconds, the LRU policy picks the least-recently-used unpinned leaf and evicts it. Watch leaves drop out.
Click a leaf to pin it (refcount → 1, 🔒 icon). Click again to unpin (refcount → 0).
Pin enough leaves that every unpinned one evicts — the remaining all-pinned set is immortal. Pin all leaves → a yellow warning appears ("no eligible leaf"): eviction stalls because there's nothing safe to remove.

Try this pattern: under HIGH pressure, pin the 3 leaves you care about most. Watch the other 5 evict over time. The 3 pinned leaves survive indefinitely — the cache is protecting your hot prefixes exactly the way cache-aware scheduling would.

Next: the decision framework — when does prefix caching actually help, and when does it silently do nothing?

When It Helps, When It Fails

Four workload archetypes

Prefix caching is a workload-shape bet. The question isn't "should I enable it?" (everything is on by default in the major engines). The question is "will my workload actually hit the cache?"

Workload	Hit rate	Why
RAG / retrieval chatbot	~50%	Shared system prompt (hit) + unique retrieved context per query (miss). The system prompt portion hits; the retrieval chunks don't.
Agent loop	85–99%	Same tool definitions every turn, same scratchpad format, often same early-turn outputs. Tool-calling agents are the ideal prefix-caching workload.
Multi-turn chat	70%, growing	Each turn's prefix is the entire conversation up to that point. Early turns cheap, long sessions eventually win big. Thrash risk on conversations near cache capacity.
Unique prompts	0%	Per-user one-shot generation with no shared preamble. Caching is pure bookkeeping overhead.

The timestamp-at-the-front anti-pattern

The single most common way to accidentally destroy your cache hit rate:

# WRONG — kills every cache hit
prompt = f"Current time: {datetime.now()}\n\nYou are a helpful assistant...\n{user_msg}"

Every request gets a different timestamp → every prompt's first token differs → every block's hash differs → zero cache hits, regardless of workload. The 1,000 tokens of shared system prompt that follow are shared, but the cache key computed over the whole prefix treats each prompt as unique.

The fix:

# RIGHT — static content first, variable content last
prompt = f"You are a helpful assistant...\n{user_msg}\n\nCurrent time: {datetime.now()}"

Put everything that changes per request at the end. The cache walks prefixes left-to-right — anything before the first variable byte is fair game for hitting.

Order matters. Static content first, variable content last. This is the one thing every prompt engineer needs to internalize about prefix caching. An hour spent reordering your prompt template is often the highest-leverage optimization in the whole serving stack.

Silent misses on below-minimum prompts

Each provider has a minimum cacheable prefix size. Below the minimum, caching silently does nothing:

Provider	Minimum
Anthropic (Sonnet 3.5/4/4.5, Opus 4)	1,024 tokens
Anthropic (Sonnet 4.6)	2,048 tokens
Anthropic (Opus 4.5/4.6/4.7, Haiku 4.5)	4,096 tokens
OpenAI (GPT-4o, o1)	1,024 tokens
Gemini 1.5 Flash	1,024 tokens
Gemini 1.5 Pro	4,096 tokens

Silent miss. Anthropic returns cache_creation_input_tokens: 0 silently when your prefix is below threshold. No error, no warning. Your code thinks caching is on; your bill says otherwise. Log this field explicitly if caching matters to your cost model.

Multi-tenant security: the cache-timing attack

Caches are observable through timing. In a multi-tenant deployment, if tenant A and tenant B share a global cache, A can probe B's recent prompts: send prompts A suspects B might have sent, measure response latency. Fast response → cache hit → B recently sent something matching that prefix. Slow response → miss.

vLLM added cache_salt (RFC #16016) to mix per-tenant state into every block hash. With distinct salts, tenant A's block hashes can never collide with tenant B's, so the cache effectively partitions by tenant.

Anthropic isolates at the org level today; they announced a move to workspace-level isolation in February 2026. OpenAI and Gemini's docs describe per-account / per-project isolation without detailing the mechanism.

If you're building a multi-tenant product on top of a shared serving engine, verify your cache-isolation strategy before launch. This is one of the places "default configuration" is not safe.

Pricing math: what 90% actually means

Take a concrete scenario: 10,000 requests × 1,000-token system prompt × Anthropic Sonnet. Anthropic input is $3.00 per million tokens base. Cache write = 1.25×, cache read = 0.1×.

Without caching:

10,000 × 1,000 × ($3.00 / 1,000,000) = $30.00

With caching (5-min TTL, keeps prefix warm):

1 write:   1 × 1,000 × $3.00 × 1.25 / 1M  = $0.00375
9,999 reads: 9,999 × 1,000 × $3.00 × 0.10 / 1M  = $3.00
Total:                                       ≈ $3.00

That's a ~90% reduction, matching the headline discount. The write cost (~$0.004) is negligible at scale; the amortization happens on request 2 onward.

On the right panel, toggle the workload, provider, size, and anti-pattern controls. The cost bars update with the actual per-provider multipliers. Try:

Agent loop + Anthropic + 4,096 tokens → ~85% savings.
Flip anti-pattern ON → hit rate and savings both collapse. Proof the pricing math is workload-dependent.
Unique prompts → 0% savings regardless of provider.
256 tokens + Anthropic → warning banner (below 1,024 minimum) + 0% savings, silent miss.

Hydragen: a different axis entirely

One further reading worth flagging explicitly so you don't conflate it with prefix caching: Hydragen (arXiv 2402.05099) fuses the attention math itself across sequences that share a prefix. It doesn't cache KV — it changes how the attention kernel processes shared-prefix batches, turning memory-bound matrix-vector ops into compute-bound matrix-matrix ops. Up to 32× throughput on CodeLlama-13B for extreme shared-prefix workloads.

Hydragen is complementary, not an alternative: you can do prefix caching (reuse KV) AND Hydragen-style attention (fuse queries over the shared KV). Different layer of the stack; same underlying shared-prefix insight.