The news. On June 10, 2026, researchers posted VIA-SD (arXiv:2606.12243). It restructures speculative decoding into a multi-tier verifier: high-confidence draft tokens are accepted directly, medium-confidence tokens are re-checked by a slim verifier carved from the full model via intra-model routing, and only the genuinely uncertain tokens trigger a full-model verification. The authors report a 0.10–0.22 absolute drop in the rejection rate, a 10–20% speedup over speculative-decoding baselines, and 2.5–3× over plain, non-drafting decoding. Read the paper →
Picture one grader handed a stack of answers that a fast student raced ahead and filled in. The obvious-right ones, the grader checks off at a glance — no work. The hopeless ones get the full line-by-line regrade. But many of those answers sit in between: they look roughly right and only need a second glance. The waste isn't in the easy answers or the hard ones — it's in sending every middling answer to the full regrade when a quick skim would have settled it. That stack is the drafter's proposed tokens; the grader is the verifier; and "full regrade for everything uncertain" is exactly how plain speculative decoding behaves.
Underneath the metaphor, the bottleneck is the verification step. Standard speculative decoding is binary: each drafted token is either accepted, or it is rejected and the full target model re-runs to produce the correct token. That full pass is the single most expensive operation in the decode loop, and every rejected token pays for one — even the near-misses that the model was almost sure about anyway. The drafter's job is to keep the rejection rate low, but no drafter is perfect, so the loop keeps cashing in expensive full passes on tokens that barely missed.
VIA-SD breaks the binary into three tiers gated by confidence. A token the verifier is confident about is accepted outright. A token it is uncertain about goes not to the full model but to a slim verifier — and the load-bearing trick is that this slim verifier is a routed sub-network of the verifier itself, sharing its weights, so it adds no second model to GPU memory. Only the genuinely low-confidence tokens fall through to a full-model verification. The middle tier is the whole idea: it intercepts the near-misses that used to trigger a full pass and settles them with a much cheaper one.
Token-by-token acceptance check
Where it earns its keep
Picture a draft window where, under plain speculative decoding, the verifier rejects 3 of every 10 proposed tokens — a 0.30 rejection rate (illustrative). Each of those 3 rejects costs a full-model verification pass, the priciest step in the loop. VIA-SD's confidence gate diverts the near-misses — the medium-confidence tokens — to the slim sub-network, which runs at a fraction of the full model's cost and shares its weights, so memory is unchanged. The paper reports the rejection rate falling by an absolute 0.10 to 0.22; take the middle of that band and the 0.30 rate drops to about 0.15 — so, in this illustrative case, roughly half the full-model passes disappear, replaced by far cheaper slim passes. That is where the reported 10–20% speedup over speculative-decoding baselines comes from. The headline 2.5–3× over plain, non-drafting decoding is mostly the speedup speculative decoding already delivered; VIA-SD's own contribution is the incremental 10–20% on top, bought by paying full price far less often.
| Approach | Verification structure | Cost for an uncertain token | Adds a model to memory? |
|---|---|---|---|
| Plain (non-drafting) decode | one forward pass per token, no speculation | n/a — nothing is guessed | no |
| Speculative decoding (baseline) | binary: accept the draft, or full re-verify | one full-model pass | yes — the small drafter |
| VIA-SD (this paper) | three tiers: accept · slim verify · full verify | a slim routed-subnetwork pass | no extra — slim verifier shares the verifier's weights |
Because the change lives in how tokens are verified rather than in the model's weights or the drafter, it composes with the rest of the speculative-decoding toolkit — the same idea could sit behind a better-trained drafter or a tree verifier, and the production gain will still depend on the model family and where the workload actually spends its time.
Goes deeper in: LLM Serving → Speculative Decoding → The Verification Algorithm
Continue in trackSpeculative Decoding — the draft-verify loop VIA-SD rebuildsRelated explainers
- PPOW — window-level RL for speculative drafters — the complementary lever: train the drafter to miss less, instead of re-checking misses cheaper
- PSD — parallel speculative decoding for diffusion LLMs — the same draft-verify idea ported to a non-autoregressive model
- MarginGate — margin-gated verification for batch-invariant decoding — another rethink of the verification step, aimed at determinism rather than speed