Speculative Decoding — Draft, Verify, Accelerate
The Decode Bottleneck
What is Speculative Decoding?
Speculative decoding is a technique that makes LLM inference 2-3x faster by using a small "draft" model to guess multiple tokens ahead, then verifying all guesses at once with the large target model. The output is mathematically identical to running the target model alone — no quality trade-off.
Think of it like a restaurant kitchen. The safe approach: wait for each customer to order, then start cooking. The speculative approach: a junior cook starts preparing likely dishes ahead of time. The head chef checks each dish — if it's right, serve it immediately (time saved). If it's wrong, toss it and cook the correct one (small waste, no harm done). Most of the time the guess is right, so the kitchen runs much faster.
In computer science, this is called speculative execution — CPUs have used the same trick for decades. Speculative decoding applies it to LLM token generation: a small "junior" model drafts tokens ahead, and the large "head chef" model verifies them.
The Decode Bottleneck
Recall from the Inference Engine module: LLM generation has two phases. Prefill processes all prompt tokens in parallel — it's compute-bound, and the GPU's math units are saturated. Decode generates tokens one at a time — it's memory-bandwidth-bound, reading all model weights from HBM for each single token.
Here's the problem: a 7B model means reading ~14 GB of weights from memory every single step. During decode, the GPU reads 14 GB and produces... one token. The GPU's compute units are mostly idle — waiting for memory.
The Unintuitive Observation
Here's what makes speculative decoding possible: processing N tokens costs nearly the same as processing 1 token.
Imagine a librarian who must walk to a distant warehouse to fetch a heavy reference book (14 GB of model weights). Once she carries it back to the desk, looking up one answer takes a second. But looking up 10 answers from the same book takes barely any extra time — the expensive part was the walk, not the lookup. That's exactly what happens during decode: the GPU spends most of its time loading weights from memory (the walk), not computing (the lookup). Whether it applies those weights to 1 token or 10 tokens, the loading cost is the same ~14 GB. The extra compute for 10 tokens barely registers.
As Andrej Karpathy put it: "Forwarding an LLM on a single input token takes about as much time as forwarding an LLM on K input tokens in a batch."
This means: if we could somehow give the target model K tokens to verify at once instead of generating them one by one, each forward pass would produce K tokens instead of 1 — for nearly the same wall-clock time.
But where do those K tokens come from? That's Step 2.
On the right panel: Watch how each token requires a full forward pass — the GPU reads 14 GB of weights each time but only generates one token. That's the bottleneck speculative decoding attacks.
The Draft-Verify Idea
The Draft-Verify Idea
The K tokens come from a draft model — a much smaller, faster model from the same family. For example, Llama 3.2 1B (the "draft") generates candidate tokens for Llama 3.1 70B (the "target").
How It Works
The loop is simple:
- Draft: The small model generates K tokens one by one (fast — it's tiny)
- Verify: The target model checks all K draft tokens at once, in a single pass through the model
- Accept/Reject: Starting from the first token, accept each draft token that the target agrees with. Stop at the first disagreement.
- Correct: Replace the rejected token with what the target actually wanted
- Repeat from the corrected position
A Concrete Example
Prompt: "The capital of France is"
The draft model (1B) generates 5 tokens quickly: "Par" "is" "." "It" "is"
The target model (70B) verifies all 5 in one pass. How? Both models output a confidence score (probability) for each token. The target compares its own confidence to the draft's:
- If the target is equally or more confident → the draft got it right → accept
- If the target is less confident → the draft guessed wrong → reject, and use the target's own answer instead
Walking through the example:
- "Par" — draft: 72% confident, target: 85% confident (target agrees even more) → accepted ✓
- "is" — draft: 68%, target: 80% → accepted ✓
- "." — draft: 55%, target: 60% → accepted ✓
- "It" — draft: 45%, target: only 18% (target preferred "The") → rejected ✗ → corrected to "The"
- "is" — after a rejection, all remaining tokens are discarded (they were based on the wrong token)
Result: 4 tokens ("Paris. The") from 2 forward passes (1 draft + 1 verify). Standard decoding would have needed 4 forward passes. That's a 2x speedup — and the output is identical.
Why Verification is Cheap
A forward pass is one complete run through the model — feeding tokens in, getting a prediction out. It's the basic unit of work: every token the model generates costs one forward pass.
Now, there are two kinds of forward passes:
- Processing multiple tokens at once (called "prefill"): the model reads through all tokens in parallel — like when it first reads your prompt. The GPU is busy doing math. Fast.
- Generating one new token (called "decode"): the model produces one token per pass. As you saw in Step 1, this is slow — the GPU spends 85% of the time loading weights, only 15% computing.
The key insight from Step 1: prefill (many tokens) costs almost the same wall-clock time as decode (one token), because the bottleneck is loading weights (same 14 GB either way), not computing.
Verification exploits this: when the target model checks K draft tokens, it processes them all together — the same way it reads a prompt. So verifying 5 tokens takes roughly the same time as generating 1 token. The verification is essentially "free."
The draft model's cost is also cheap — it's 70x smaller, so each of its forward passes is ~70x faster than the target's.
On the right panel: Click Play to watch the draft model generate tokens quickly (amber lane), then the target model verify them all at once (indigo lane). Green = accepted, red = rejected and corrected. Try switching examples and adjusting K.
The Verification Algorithm
The Verification Algorithm
How does the target model decide which draft tokens to accept? And how does this guarantee identical output? The answer is rejection sampling.
The Accept/Reject Rule
For each draft token x, we compare two probabilities:
- p(x) — the draft model's probability for this token
- q(x) — the target model's probability for this token
The rule: accept token x with probability min(1, q(x) / p(x))
This means:
- If q(x) ≥ p(x) — the target is at least as confident as the draft → always accept
- If q(x) < p(x) — the target is less confident → accept with probability q(x)/p(x), which is a coin flip weighted by how much the target disagrees
Token-by-token acceptance check
When a Token is Rejected
At the first rejection, the draft sequence is truncated. The rejected token is replaced by resampling from the residual distribution:
residual(x) = normalize(max(0, q(x) − p(x)))
This residual captures probability mass where the target model is more confident than the draft — exactly the tokens the draft under-estimated. The resampled token fills the gap.
The Mathematical Guarantee
This isn't an approximation. Leviathan et al. (ICML 2023) proved that speculative sampling produces output that is mathematically identical to sampling from the target model alone (Theorem 1).
Every output token follows exactly the target model's distribution q. The draft model only affects speed (how many tokens are accepted per loop), never quality.
Acceptance Rate
In practice, acceptance rates range from 53% to 75% per token, depending on:
- How well-matched the draft and target models are (same family = better)
- Task predictability (factual Q&A = high acceptance, creative writing = low)
- Draft length K (longer drafts have lower per-token acceptance at later positions)
On the right panel: Step through each draft token's verification with Next Token →. Then scroll to the Sandbox below: pick a token, drag the q(x) slider, and watch the accept rule min(1, q/p) snap from "always accept" to a coin flip the moment q dips below p. The draft-excess strip (p − q) shows the over-confidence pressure that drives rejection — the actual residual distribution that gets resampled lives at other tokens at this position, visualized in the residual chart that appears when you step to a true rejection above.
Draft Model Variants
Draft Model Variants
So far we've used a separate small model as the draft. But there are other ways to generate draft tokens — each with different tradeoffs in memory, speed, and applicability.
Independent Drafting — Separate Small Model
The classic approach from the original papers (Leviathan et al. and Chen et al., both 2023). Use a smaller model from the same family:
- Example: Llama 3.2 1B drafting for Llama 3.1 70B
- Speedup: 2-3x at low QPS
- Pros: Highest acceptance rate (trained on same data, same tokenizer)
- Cons: Must load two models in GPU memory (1B is small, but it adds up at scale)
This is the most straightforward and well-understood approach.
Self-Drafting — Heads on the Target Model
The independent approach requires loading a whole second model. What if we could make the target model draft its own tokens? That's the idea behind self-drafting: attach small, cheap "extra heads" to the target model that guess future tokens. Think of it like giving the model extra eyes that look ahead.
Medusa (named after the mythological figure with multiple heads) attaches K small prediction heads to the target model. Each head looks at the model's final output and guesses a different future position — head 1 guesses the next token, head 2 guesses the token after that, and so on. The base model stays frozen; only the small heads are trained. Achieves 2.2-3.6x speedup.
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) takes a different approach: instead of looking at just the final output, it looks deeper — at the model's second-to-last layer, which contains richer information. This is like reading someone's draft notes (detailed) instead of their final summary (compressed). Because the features are richer, EAGLE makes better guesses — ~3x speedup, about 1.6x faster than Medusa.
Layer skipping is the simplest self-drafting approach: run tokens through only half (or a quarter) of the target model's layers as a quick draft. No extra parts needed at all — but the guesses are rougher since fewer layers means less reasoning.
Free Drafting — No Model at All
Prompt lookup decoding builds an n-gram index from the prompt itself. When the last 2-3 generated tokens match a sequence in the prompt, it predicts the next tokens will continue that match.
- Speedup: Up to 2.8x on summarization (CNN/DailyMail)
- Pros: Zero compute cost for drafting — no model, no GPU, no memory
- Cons: Only works when the output closely mirrors the input (summarization, RAG, Q&A where the model quotes from context)
On the right panel: Toggle between the three approaches. Notice how Independent uses a separate model (more memory), Self-Drafting adds heads to the target (less memory, requires fine-tuning), and Prompt Lookup needs no model at all (free but limited to copy-heavy tasks). The comparison table summarizes the tradeoffs.
When to Use It
When to Use Speculative Decoding
Speculative decoding isn't a universal win. It has a clear sweet spot — and clear failure modes.
Real-World Speedup Numbers
| Approach | Speedup | Setup |
|---|---|---|
| Classical (Llama 1B → 70B) | 2-3x | Low QPS |
| Medusa | 2.2-3.6x | Fine-tuned heads |
| EAGLE | ~3x | Fine-tuned draft head |
| Prompt Lookup | Up to 2.8x | Summarization only |
| TensorRT-LLM (Llama 405B, 4× H200) | 3.6x | FP8 quantization |
The Sweet Spot
Speculative decoding works best for latency-sensitive, low-concurrency serving:
- Chatbots and coding assistants (interactive, low QPS per user)
- Search and retrieval (fast response expected)
- Single-user inference (local deployment, personal assistants)
The expected tokens per loop follows a geometric series:
Expected tokens per draft–verify cycle
When It Fails
High concurrency (high QPS): When the GPU batch is already full of requests, adding draft model computation creates more work without benefit. The GPU is already saturated — there's no spare capacity to exploit. Both vLLM and TensorRT-LLM explicitly warn: speculative decoding is for latency, not throughput.
High-entropy outputs (creative writing, brainstorming): The draft model's guesses diverge from the target's preferences. Acceptance rate drops below ~50%, and speculative decoding becomes slower than standard decoding.
Distribution mismatch: If the draft model was trained on different data or with different RLHF/instruction tuning, its probability distribution diverges structurally from the target. Acceptance rate suffers even on "easy" tokens.
Draft length K too large: With K=10 and 50% acceptance, you draft 10 tokens but only keep ~1-2. The compute spent on the 8 rejected tokens was wasted. K should be tuned empirically — 3-7 is typical.
Acceptance Rate: The Key Diagnostic
The single most important metric is acceptance rate (α). It tells you whether speculative decoding is helping or hurting:
- α > 70% — excellent match, 2-3x speedup likely
- α = 50-70% — moderate, 1.5-2x speedup
- α < 50% — speculative decoding is likely slower than standard. Investigate the mismatch.
On the right panel: Drag the acceptance rate slider and toggle QPS level. Watch how speedup degrades as acceptance rate drops below 50% or QPS increases. At high QPS, the GPU is already saturated — draft tokens add overhead without benefit.