What is JetSpec's parallel tree drafting?

JetSpec is a speculative-decoding framework that trains a single causal draft head to emit a whole tree of candidate next-tokens in one forward pass, instead of a single linear chain of guesses. Each branch of the tree is conditioned on its own prefix, so every path is a real continuation built to align with the target model's token factorization, per the paper. The target model then verifies the tree and keeps the longest branch that matches what it would have generated, so a bigger draft budget produces a wider tree and longer accepted runs rather than more wasted tokens.

Why does drafting a tree beat drafting a single line?

In plain speculative decoding the drafter proposes one straight line of tokens, and verification throws away every guess after the first position where the draft diverges from the target. So pouring more budget into a longer line mostly produces tokens past the first mistake that can never be accepted — the draft-budget scaling ceiling. A tree keeps several candidate tokens at each position, so when one branch diverges the verifier can follow a sibling branch that matches, extending the accepted run. The same compute per pass yields a longer accepted sequence, which is what the decode loop converts into wall-clock speedup.

How does JetSpec compare to Medusa or EAGLE?

Medusa and EAGLE also draft trees rather than single lines, but their candidate heads tend to be largely independent and the tree shape is usually preset. JetSpec's contribution, per the paper, is a single causal draft head whose branches are each conditioned on their own prefix over the target's fused hidden states, so the proposed tree is built to align with the target's own token factorization and a larger draft budget converts cleanly into longer accepted sequences. On H100 GPUs the paper reports up to 9.64× speedup on MATH-500 and 4.58× on conversational workloads.

JetSpec speeds speculative decoding up to 9.64× — Parallel tree drafting

Jargon

Speculative decoding: A serving trick where a small fast drafter guesses several tokens ahead and the big target model verifies the whole guess in one forward pass, keeping the accepted prefix. Full primer →
Draft head: JetSpec's drafter is not a separate model but a small head bolted onto the target's hidden states — one extra forward of that head emits the candidate tokens.
Draft budget: How many candidate tokens the drafter is allowed to propose before a verification pass. The central question is what a bigger budget actually buys you.
Token tree: Instead of one linear sequence of guesses, a branching set of candidate continuations — at each position several next-tokens are kept, so one tree holds many possible lines at once.
Branch-wise causal conditioning: Each branch of the tree is generated conditioned on its own prefix — the tokens above it on the branch — so every path is a real continuation, not a flat list of independent top-k guesses.
Acceptance length: The average number of drafted tokens the verifier keeps per pass. Tree drafting raises it by giving the verifier alternatives to walk when one branch diverges. Verification primer →
Draft-budget scaling ceiling: The point where spending more on the draft stops helping — on a single line, extra guesses past the first mistake are pure waste. JetSpec is built to push that ceiling up.

The news. On June 25, 2026, UC San Diego's Hao AI Lab posted JetSpec (arXiv:2606.18394). It is a speculative-decoding framework that trains a single causal draft head to emit an entire tree of candidate continuations in one forward pass — each branch causally conditioned on its own prefix over what the paper calls the target model's fused hidden states, so the proposed tree is built to match the target's own token factorization. The result: a larger draft budget converts into longer accepted token sequences rather than more rejected drafts. On H100 GPUs the authors report up to 9.64× speedup on MATH-500 and 4.58× on conversational workloads. Read the paper →

Picture a GPS that, instead of plotting one route, maps a whole fan of routes at every junction before you pull out of the driveway. With a single fixed route, one missed turn means the rest of the directions are useless and the GPS has to recompute from scratch. With the fan already mapped, a wrong turn just drops you onto a neighbouring branch that was prepared in the same planning pass — you keep driving, and far more of your trip was charted in advance. That swap, from one route to a fan of routes mapped in one pass, is exactly what JetSpec does to the drafter in speculative decoding.

Underneath the metaphor, the bottleneck is the way a normal drafter guesses. It proposes one straight line of tokens — token 1, then token 2 conditioned on token 1, and so on — and the target model verifies the line in a single pass. But verification is unforgiving: the first position where the draft diverges from what the target would have produced throws away every guess after it. A drafter that proposes ten tokens but goes wrong at the third contributes only two accepted tokens; the other seven were computed and discarded. This is the draft-budget scaling ceiling — pour more budget into a single line and most of it lands past the first mistake, where it cannot be used.

JetSpec breaks the line into a tree. One causal draft head, in a single forward pass, emits not one continuation but a branching set of them: at each position it keeps several candidate next-tokens, and crucially each branch is conditioned on its own prefix, so every path down the tree is a genuine continuation rather than a flat bag of independent guesses. The target model then walks the tree and keeps the longest branch that matches what it would have generated. Because the divergence point now has alternatives waiting, a wrong turn on one branch no longer wastes the whole budget — a sibling branch can carry the accepted run further. A bigger draft budget becomes a wider, deeper tree, which the verifier turns into a longer accepted sequence, not a longer pile of rejects.

Draft Model (1B) — generates K=5 tokens

Paris.Itis

↓ verify all at once↓

Target Model (70B) — one forward pass

✓ Par✓ is✓ .✗ It→The— is

accepted

rejected → corrected

discarded

3 accepted + 1 corrected = 4 tokens from 2 forward passes

↻ repeat until done

Where it earns its keep

Picture a draft budget of 8 tokens (illustrative — the paper's real gains are the H100 numbers below). Spent as one line, suppose the target diverges at position 3: you keep 2 tokens and discard the remaining 6, for an acceptance length of 2 out of 8 spent. Now spend the same 8-token budget as a tree — say a root that branches two ways at each of the first few positions. At the spot where the single line went wrong, the tree already holds the other candidate token on a sibling branch, so the verifier follows that branch instead and the accepted run reaches position 5 before any branch runs out — an acceptance length of 5 from the same 8-token budget. In this toy example the same draft budget accepts 2.5× more tokens — and that multiplier is the whole game, because acceptance length is exactly what the decode loop turns into wall-clock speedup. The paper's reported up to 9.64× on MATH-500 is this effect compounded across a real model and a real draft tree.

Draft shape	What it proposes	Branches conditioned on prefix?	What a bigger budget buys
Linear draft (vanilla spec-decode)	one straight line of tokens	n/a — single path	mostly wasted tokens past the first miss
Static draft trees (Medusa / EAGLE-style)	a fixed tree of candidates	partially — heads tend to be largely independent	more coverage, but a mostly preset tree shape
JetSpec parallel tree drafting	a causal tree from one draft head, in one pass	yes — each branch follows from its own prefix	longer accepted runs (up to 9.64× on MATH-500, H100)

Because the change lives in how the draft is shaped rather than in the target model's weights, it composes with the rest of the toolkit — a better-trained draft head, a smarter verification rule, or a load-aware deployment all still apply, and the production gain will depend on the model family and how predictable the text is. The headline is not a new model; it is a better-shaped guess — one that finally makes a bigger draft budget pay off.

Goes deeper in: LLM Serving → Speculative Decoding → Draft Model Variants

Continue in trackSpeculative Decoding — the draft-verify loop JetSpec reshapes

Related explainers

Spec-decode latency paper — Load-dependent latency model — the other axis: why even a great drafter's speedup decays as the server saturates
PPOW — window-level RL for speculative drafters — trains the drafter to miss less; JetSpec instead reshapes the draft so a miss costs less
VIA-SD — Tiered confidence-gated verification — rethinks the verify side rather than the draft side of the same loop

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based