What is the LLM Serving track?

Seven interactive modules covering the production side of LLM inference: engine internals, speculative decoding, prefill/decode disaggregation, serving metrics and SLOs, CUDA Graphs, multi-LoRA serving, and prefix caching.

Do I need to finish LLM Internals first?

Not strictly, but understanding KV cache, batching, and PagedAttention from the LLM Internals track makes these serving topics click faster.

Who is this track for?

Engineers running LLM inference in production — people tuning vLLM / SGLang / TensorRT-LLM, choosing hardware, or debugging TTFT and P99 latency.

What is speculative decoding in LLM inference?

Speculative decoding uses a small, fast draft model to generate candidate tokens, then the large target model verifies all candidates in a single forward pass. Accepted tokens are kept; the first rejected token is corrected. This produces multiple tokens per target model forward pass, achieving 2-3x latency reduction with mathematically identical output quality.

How does speculative decoding guarantee output quality?

Through rejection sampling. Each draft token is accepted with probability min(1, q(x)/p(x)) where q is the target and p is the draft distribution. On rejection, the token is resampled from the residual distribution max(0, q(x)-p(x)). This guarantees every output token follows exactly the target model's distribution — it's mathematically identical, not an approximation.

What speedup does speculative decoding achieve?

Typical speedups are 2-3x for well-matched draft/target pairs at low query rates. Medusa achieves 2.2-3.6x, EAGLE ~3x, and prompt lookup decoding up to 2.8x on summarization tasks. TensorRT-LLM reports 3.6x on Llama 405B with FP8. Speedup depends on acceptance rate and concurrency level. Speedup degrades as GPU concurrency rises — once the target model's forward pass is already batch-saturated, extra draft tokens become wasted work. The sweet spot is latency-sensitive single-stream serving (chat, code completion) where TPOT matters most. High-entropy tasks like creative writing also see less benefit because acceptance rates drop below 50%.

What is the difference between Medusa and EAGLE for speculative decoding?

Medusa adds auxiliary prediction heads to the target model's final hidden state, each predicting a future token offset. EAGLE uses the target model's penultimate hidden states to train a lightweight draft head, conditioning on richer features. EAGLE achieves ~3x speedup (1.6x faster than Medusa) because the deeper features produce higher acceptance rates.

When should you not use speculative decoding?

Avoid speculative decoding at high query concurrency (GPU is already saturated — draft tokens add overhead), for high-entropy generation tasks (creative writing — acceptance rate drops below 50%), and when draft/target distributions are mismatched. It's designed for latency-sensitive, low-QPS serving like chatbots and coding assistants.

Speculative Decoding — Draft, Verify, Accelerate

The Decode Bottleneck

What is Speculative Decoding?

Speculative decoding is a technique that makes LLM inference 2-3x faster by using a small "draft" model to guess multiple tokens ahead, then verifying all guesses at once with the large target model. The output is mathematically identical to running the target model alone — no quality trade-off.

Think of it like a restaurant kitchen. The safe approach: wait for each customer to order, then start cooking. The speculative approach: a junior cook starts preparing likely dishes ahead of time. The head chef checks each dish — if it's right, serve it immediately (time saved). If it's wrong, toss it and cook the correct one (small waste, no harm done). Most of the time the guess is right, so the kitchen runs much faster.

In computer science, this is called speculative execution — CPUs have used the same trick for decades. Speculative decoding applies it to LLM token generation: a small "junior" model drafts tokens ahead, and the large "head chef" model verifies them.

The Decode Bottleneck

Recall from the Inference Engine module: LLM generation has two phases. Prefill processes all prompt tokens in parallel — it's compute-bound, and the GPU's math units are saturated. Decode generates tokens one at a time — it's memory-bandwidth-bound, reading all model weights from HBM for each single token.

Here's the problem: a 7B model means reading ~14 GB of weights from memory every single step. During decode, the GPU reads 14 GB and produces... one token. The GPU's compute units are mostly idle — waiting for memory.

The Unintuitive Observation

Here's what makes speculative decoding possible: processing N tokens costs nearly the same as processing 1 token.

Imagine a librarian who must walk to a distant warehouse to fetch a heavy reference book (14 GB of model weights). Once she carries it back to the desk, looking up one answer takes a second. But looking up 10 answers from the same book takes barely any extra time — the expensive part was the walk, not the lookup. That's exactly what happens during decode: the GPU spends most of its time loading weights from memory (the walk), not computing (the lookup). Whether it applies those weights to 1 token or 10 tokens, the loading cost is the same ~14 GB. The extra compute for 10 tokens barely registers.

As Andrej Karpathy put it: "Forwarding an LLM on a single input token takes about as much time as forwarding an LLM on K input tokens in a batch."

This means: if we could somehow give the target model K tokens to verify at once instead of generating them one by one, each forward pass would produce K tokens instead of 1 — for nearly the same wall-clock time.

But where do those K tokens come from? That's Step 2.

On the right panel: Watch how each token requires a full forward pass — the GPU reads 14 GB of weights each time but only generates one token. That's the bottleneck speculative decoding attacks.

The Draft-Verify Idea

The K tokens come from a draft model — a much smaller, faster model from the same family. For example, Llama 3.2 1B (the "draft") generates candidate tokens for Llama 3.1 70B (the "target").

How It Works

The loop is simple:

Draft: The small model generates K tokens one by one (fast — it's tiny)
Verify: The target model checks all K draft tokens at once, in a single pass through the model
Accept/Reject: Starting from the first token, accept each draft token that the target agrees with. Stop at the first disagreement.
Correct: Replace the rejected token with what the target actually wanted
Repeat from the corrected position

Draft Model (1B) — generates K=5 tokens

Paris.Itis

↓ verify all at once↓

Target Model (70B) — one forward pass

✓ Par✓ is✓ .✗ It→The— is

accepted

rejected → corrected

discarded

3 accepted + 1 corrected = 4 tokens from 2 forward passes

↻ repeat until done

A Concrete Example

Prompt: "The capital of France is"

The draft model (1B) generates 5 tokens quickly: "Par" "is" "." "It" "is"

The target model (70B) verifies all 5 in one pass. How? Both models output a confidence score (probability) for each token. The target compares its own confidence to the draft's:

If the target is equally or more confident → the draft got it right → accept
If the target is less confident → the draft guessed wrong → reject, and use the target's own answer instead

Walking through the example:

"Par" — draft: 72% confident, target: 85% confident (target agrees even more) → accepted ✓
"is" — draft: 68%, target: 80% → accepted ✓
"." — draft: 55%, target: 60% → accepted ✓
"It" — draft: 45%, target: only 18% (target preferred "The") → rejected ✗ → corrected to "The"
"is" — after a rejection, all remaining tokens are discarded (they were based on the wrong token)

Result: 4 tokens ("Paris. The") from 2 forward passes (1 draft + 1 verify). Standard decoding would have needed 4 forward passes. That's a 2x speedup — and the output is identical.

Why Verification is Cheap

A forward pass is one complete run through the model — feeding tokens in, getting a prediction out. It's the basic unit of work: every token the model generates costs one forward pass.

Now, there are two kinds of forward passes:

Processing multiple tokens at once (called "prefill"): the model reads through all tokens in parallel — like when it first reads your prompt. The GPU is busy doing math. Fast.
Generating one new token (called "decode"): the model produces one token per pass. As you saw in Step 1, this is slow — the GPU spends 85% of the time loading weights, only 15% computing.

The key insight from Step 1: prefill (many tokens) costs almost the same wall-clock time as decode (one token), because the bottleneck is loading weights (same 14 GB either way), not computing.

Verification exploits this: when the target model checks K draft tokens, it processes them all together — the same way it reads a prompt. So verifying 5 tokens takes roughly the same time as generating 1 token. The verification is essentially "free."

The draft model's cost is also cheap — it's 70x smaller, so each of its forward passes is ~70x faster than the target's.

On the right panel: Click Play to watch the draft model generate tokens quickly (amber lane), then the target model verify them all at once (indigo lane). Green = accepted, red = rejected and corrected. Try switching examples and adjusting K.

The Verification Algorithm

How does the target model decide which draft tokens to accept? And how does this guarantee identical output? The answer is rejection sampling.

The Accept/Reject Rule

For each draft token x, we compare two probabilities:

p(x) — the draft model's probability for this token
q(x) — the target model's probability for this token

The rule: accept token x with probability min(1, q(x) / p(x))

This means:

If q(x) ≥ p(x) — the target is at least as confident as the draft → always accept
If q(x) < p(x) — the target is less confident → accept with probability q(x)/p(x), which is a coin flip weighted by how much the target disagrees

Token-by-token acceptance check

p(x) draft

q(x) target

Par

72%

85%

q≥p ✓

68%

80%

q≥p ✓

55%

60%

q≥p ✓

45%

18%

q<p ✗

Residual distribution for rejected "It" — sample replacement:

The42%

It0%

Paris22%

This15%

A10%

→ sample "The" (highest residual mass)

When a Token is Rejected

At the first rejection, the draft sequence is truncated. The rejected token is replaced by resampling from the residual distribution:

residual(x) = normalize(max(0, q(x) − p(x)))

This residual captures probability mass where the target model is more confident than the draft — exactly the tokens the draft under-estimated. The resampled token fills the gap.

The Mathematical Guarantee

This isn't an approximation. Leviathan et al. (ICML 2023) proved that speculative sampling produces output that is mathematically identical to sampling from the target model alone (Theorem 1).

Every output token follows exactly the target model's distribution q. The draft model only affects speed (how many tokens are accepted per loop), never quality.

Acceptance Rate

In practice, acceptance rates range from 53% to 75% per token, depending on:

How well-matched the draft and target models are (same family = better)
Task predictability (factual Q&A = high acceptance, creative writing = low)
Draft length K (longer drafts have lower per-token acceptance at later positions)

On the right panel: Step through each draft token's verification with Next Token →. Then scroll to the Sandbox below: pick a token, drag the q(x) slider, and watch the accept rule min(1, q/p) snap from "always accept" to a coin flip the moment q dips below p. The draft-excess strip (p − q) shows the over-confidence pressure that drives rejection — the actual residual distribution that gets resampled lives at other tokens at this position, visualized in the residual chart that appears when you step to a true rejection above.

Draft Model Variants

So far we've used a separate small model as the draft. But there are other ways to generate draft tokens — each with different tradeoffs in memory, speed, and applicability.

Independent Drafting — Separate Small Model

The classic approach from the original papers (Leviathan et al. and Chen et al., both 2023). Use a smaller model from the same family:

Example: Llama 3.2 1B drafting for Llama 3.1 70B
Speedup: 2-3x at low QPS
Pros: Highest acceptance rate (trained on same data, same tokenizer)
Cons: Must load two models in GPU memory (1B is small, but it adds up at scale)

This is the most straightforward and well-understood approach.

Self-Drafting — Heads on the Target Model

The independent approach requires loading a whole second model. What if we could make the target model draft its own tokens? That's the idea behind self-drafting: attach small, cheap "extra heads" to the target model that guess future tokens. Think of it like giving the model extra eyes that look ahead.

Medusa (named after the mythological figure with multiple heads) attaches K small prediction heads to the target model. Each head looks at the model's final output and guesses a different future position — head 1 guesses the next token, head 2 guesses the token after that, and so on. The base model stays frozen; only the small heads are trained. Achieves 2.2-3.6x speedup.

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) takes a different approach: instead of looking at just the final output, it looks deeper — at the model's second-to-last layer, which contains richer information. This is like reading someone's draft notes (detailed) instead of their final summary (compressed). Because the features are richer, EAGLE makes better guesses — ~3x speedup, about 1.6x faster than Medusa.

Layer skipping is the simplest self-drafting approach: run tokens through only half (or a quarter) of the target model's layers as a quick draft. No extra parts needed at all — but the guesses are rougher since fewer layers means less reasoning.

Free Drafting — No Model at All

Prompt lookup decoding builds an n-gram index from the prompt itself. When the last 2-3 generated tokens match a sequence in the prompt, it predicts the next tokens will continue that match.

Speedup: Up to 2.8x on summarization (CNN/DailyMail)
Pros: Zero compute cost for drafting — no model, no GPU, no memory
Cons: Only works when the output closely mirrors the input (summarization, RAG, Q&A where the model quotes from context)

On the right panel: Toggle between the three approaches. Notice how Independent uses a separate model (more memory), Self-Drafting adds heads to the target (less memory, requires fine-tuning), and Prompt Lookup needs no model at all (free but limited to copy-heavy tasks). The comparison table summarizes the tradeoffs.

When to Use It

When to Use Speculative Decoding

Speculative decoding isn't a universal win. It has a clear sweet spot — and clear failure modes.

Real-World Speedup Numbers

Approach	Speedup	Setup
Classical (Llama 1B → 70B)	2-3x	Low QPS
Medusa	2.2-3.6x	Fine-tuned heads
EAGLE	~3x	Fine-tuned draft head
Prompt Lookup	Up to 2.8x	Summarization only
TensorRT-LLM (Llama 405B, 4× H200)	3.6x	FP8 quantization

The Sweet Spot

Speculative decoding works best for latency-sensitive, low-concurrency serving:

Chatbots and coding assistants (interactive, low QPS per user)
Search and retrieval (fast response expected)
Single-user inference (local deployment, personal assistants)

The expected tokens per loop follows a geometric series:

Expected tokens per draft–verify cycle

E[tokens] = 1 + α + α² + … + αᴷ

1.00 + 0.70 + 0.49 + 0.34 + 0.24 + 0.17 = 2.94

Contribution of each term:

α⁰=1.00α¹=0.70α²=0.49α³=0.34α⁴=0.24α⁵=0.17

2 forward passes → 2.94 tokens

1.47× effective speedup vs naive

(baseline: 1 token per forward pass = 1.0×)

When It Fails

High concurrency (high QPS): When the GPU batch is already full of requests, adding draft model computation creates more work without benefit. The GPU is already saturated — there's no spare capacity to exploit. Both vLLM and TensorRT-LLM explicitly warn: speculative decoding is for latency, not throughput.

High-entropy outputs (creative writing, brainstorming): The draft model's guesses diverge from the target's preferences. Acceptance rate drops below ~50%, and speculative decoding becomes slower than standard decoding.

Distribution mismatch: If the draft model was trained on different data or with different RLHF/instruction tuning, its probability distribution diverges structurally from the target. Acceptance rate suffers even on "easy" tokens.

Draft length K too large: With K=10 and 50% acceptance, you draft 10 tokens but only keep ~1-2. The compute spent on the 8 rejected tokens was wasted. K should be tuned empirically — 3-7 is typical.

Acceptance Rate: The Key Diagnostic

The single most important metric is acceptance rate (α). It tells you whether speculative decoding is helping or hurting:

α > 70% — excellent match, 2-3x speedup likely
α = 50-70% — moderate, 1.5-2x speedup
α < 50% — speculative decoding is likely slower than standard. Investigate the mismatch.

On the right panel: Drag the acceptance rate slider and toggle QPS level. Watch how speedup degrades as acceptance rate drops below 50% or QPS increases. At high QPS, the GPU is already saturated — draft tokens add overhead without benefit.