The news. On June 25, 2026, UC San Diego's Hao AI Lab posted JetSpec (arXiv:2606.18394). It is a speculative-decoding framework that trains a single causal draft head to emit an entire tree of candidate continuations in one forward pass — each branch causally conditioned on its own prefix over what the paper calls the target model's fused hidden states, so the proposed tree is built to match the target's own token factorization. The result: a larger draft budget converts into longer accepted token sequences rather than more rejected drafts. On H100 GPUs the authors report up to 9.64× speedup on MATH-500 and 4.58× on conversational workloads. Read the paper →
Picture a GPS that, instead of plotting one route, maps a whole fan of routes at every junction before you pull out of the driveway. With a single fixed route, one missed turn means the rest of the directions are useless and the GPS has to recompute from scratch. With the fan already mapped, a wrong turn just drops you onto a neighbouring branch that was prepared in the same planning pass — you keep driving, and far more of your trip was charted in advance. That swap, from one route to a fan of routes mapped in one pass, is exactly what JetSpec does to the drafter in speculative decoding.
Underneath the metaphor, the bottleneck is the way a normal drafter guesses. It proposes one straight line of tokens — token 1, then token 2 conditioned on token 1, and so on — and the target model verifies the line in a single pass. But verification is unforgiving: the first position where the draft diverges from what the target would have produced throws away every guess after it. A drafter that proposes ten tokens but goes wrong at the third contributes only two accepted tokens; the other seven were computed and discarded. This is the draft-budget scaling ceiling — pour more budget into a single line and most of it lands past the first mistake, where it cannot be used.
JetSpec breaks the line into a tree. One causal draft head, in a single forward pass, emits not one continuation but a branching set of them: at each position it keeps several candidate next-tokens, and crucially each branch is conditioned on its own prefix, so every path down the tree is a genuine continuation rather than a flat bag of independent guesses. The target model then walks the tree and keeps the longest branch that matches what it would have generated. Because the divergence point now has alternatives waiting, a wrong turn on one branch no longer wastes the whole budget — a sibling branch can carry the accepted run further. A bigger draft budget becomes a wider, deeper tree, which the verifier turns into a longer accepted sequence, not a longer pile of rejects.
Where it earns its keep
Picture a draft budget of 8 tokens (illustrative — the paper's real gains are the H100 numbers below). Spent as one line, suppose the target diverges at position 3: you keep 2 tokens and discard the remaining 6, for an acceptance length of 2 out of 8 spent. Now spend the same 8-token budget as a tree — say a root that branches two ways at each of the first few positions. At the spot where the single line went wrong, the tree already holds the other candidate token on a sibling branch, so the verifier follows that branch instead and the accepted run reaches position 5 before any branch runs out — an acceptance length of 5 from the same 8-token budget. In this toy example the same draft budget accepts 2.5× more tokens — and that multiplier is the whole game, because acceptance length is exactly what the decode loop turns into wall-clock speedup. The paper's reported up to 9.64× on MATH-500 is this effect compounded across a real model and a real draft tree.
| Draft shape | What it proposes | Branches conditioned on prefix? | What a bigger budget buys |
|---|---|---|---|
| Linear draft (vanilla spec-decode) | one straight line of tokens | n/a — single path | mostly wasted tokens past the first miss |
| Static draft trees (Medusa / EAGLE-style) | a fixed tree of candidates | partially — heads tend to be largely independent | more coverage, but a mostly preset tree shape |
| JetSpec parallel tree drafting | a causal tree from one draft head, in one pass | yes — each branch follows from its own prefix | longer accepted runs (up to 9.64× on MATH-500, H100) |
Because the change lives in how the draft is shaped rather than in the target model's weights, it composes with the rest of the toolkit — a better-trained draft head, a smarter verification rule, or a load-aware deployment all still apply, and the production gain will depend on the model family and how predictable the text is. The headline is not a new model; it is a better-shaped guess — one that finally makes a bigger draft budget pay off.
Goes deeper in: LLM Serving → Speculative Decoding → Draft Model Variants
Continue in trackSpeculative Decoding — the draft-verify loop JetSpec reshapesRelated explainers
- Spec-decode latency paper — Load-dependent latency model — the other axis: why even a great drafter's speedup decays as the server saturates
- PPOW — window-level RL for speculative drafters — trains the drafter to miss less; JetSpec instead reshapes the draft so a miss costs less
- VIA-SD — Tiered confidence-gated verification — rethinks the verify side rather than the draft side of the same loop