The news. On June 10, 2026, researchers posted VIA-SD (arXiv:2606.12243). It restructures speculative decoding into a multi-tier verifier: high-confidence draft tokens are accepted directly, medium-confidence tokens are re-checked by a slim verifier carved from the full model via intra-model routing, and only the genuinely uncertain tokens trigger a full-model verification. The authors report a 0.10–0.22 absolute drop in the rejection rate, a 10–20% speedup over speculative-decoding baselines, and 2.5–3× over plain, non-drafting decoding. Read the paper →

Picture one grader handed a stack of answers that a fast student raced ahead and filled in. The obvious-right ones, the grader checks off at a glance — no work. The hopeless ones get the full line-by-line regrade. But many of those answers sit in between: they look roughly right and only need a second glance. The waste isn't in the easy answers or the hard ones — it's in sending every middling answer to the full regrade when a quick skim would have settled it. That stack is the drafter's proposed tokens; the grader is the verifier; and "full regrade for everything uncertain" is exactly how plain speculative decoding behaves.

Underneath the metaphor, the bottleneck is the verification step. Standard speculative decoding is binary: each drafted token is either accepted, or it is rejected and the full target model re-runs to produce the correct token. That full pass is the single most expensive operation in the decode loop, and every rejected token pays for one — even the near-misses that the model was almost sure about anyway. The drafter's job is to keep the rejection rate low, but no drafter is perfect, so the loop keeps cashing in expensive full passes on tokens that barely missed.

VIA-SD breaks the binary into three tiers gated by confidence. A token the verifier is confident about is accepted outright. A token it is uncertain about goes not to the full model but to a slim verifier — and the load-bearing trick is that this slim verifier is a routed sub-network of the verifier itself, sharing its weights, so it adds no second model to GPU memory. Only the genuinely low-confidence tokens fall through to a full-model verification. The middle tier is the whole idea: it intercepts the near-misses that used to trigger a full pass and settles them with a much cheaper one.

Token-by-token acceptance check

p(x) draft
q(x) target
Par
72%
85%
q≥p ✓
is
68%
80%
q≥p ✓
.
55%
60%
q≥p ✓
It
45%
18%
q<p ✗
Residual distribution for rejected "It" — sample replacement:
The42%
It0%
Paris22%
This15%
A10%
→ sample "The" (highest residual mass)

Where it earns its keep

Picture a draft window where, under plain speculative decoding, the verifier rejects 3 of every 10 proposed tokens — a 0.30 rejection rate (illustrative). Each of those 3 rejects costs a full-model verification pass, the priciest step in the loop. VIA-SD's confidence gate diverts the near-misses — the medium-confidence tokens — to the slim sub-network, which runs at a fraction of the full model's cost and shares its weights, so memory is unchanged. The paper reports the rejection rate falling by an absolute 0.10 to 0.22; take the middle of that band and the 0.30 rate drops to about 0.15 — so, in this illustrative case, roughly half the full-model passes disappear, replaced by far cheaper slim passes. That is where the reported 10–20% speedup over speculative-decoding baselines comes from. The headline 2.5–3× over plain, non-drafting decoding is mostly the speedup speculative decoding already delivered; VIA-SD's own contribution is the incremental 10–20% on top, bought by paying full price far less often.

ApproachVerification structureCost for an uncertain tokenAdds a model to memory?
Plain (non-drafting) decodeone forward pass per token, no speculationn/a — nothing is guessedno
Speculative decoding (baseline)binary: accept the draft, or full re-verifyone full-model passyes — the small drafter
VIA-SD (this paper)three tiers: accept · slim verify · full verifya slim routed-subnetwork passno extra — slim verifier shares the verifier's weights

Because the change lives in how tokens are verified rather than in the model's weights or the drafter, it composes with the rest of the speculative-decoding toolkit — the same idea could sit behind a better-trained drafter or a tree verifier, and the production gain will still depend on the model family and where the workload actually spends its time.

Goes deeper in: LLM Serving → Speculative Decoding → The Verification Algorithm

Continue in trackSpeculative Decoding — the draft-verify loop VIA-SD rebuilds

Related explainers

Frequently Asked Questions