The news. On June 30, 2026, the BlockPilot paper appeared on arXiv. It speeds up diffusion-based speculative decoding by learning a lightweight policy that reads a sample's prefill representation and predicts the draft block size to use for that one input. On a 4B model at temperature T = 1 it reports an acceptance length of 5.92 and a 4.20× decoding speedup. Read the paper →

Picture the delivery dispatcher at a busy depot. Orders pile up in a queue, and the dispatcher loads a stack of them onto each scooter run. Load too few and the scooter is back in five minutes for another trip — barely faster than one-at-a-time. Load too many and the route falls apart partway: the customer check keeps every order up to the first wrong one, and whatever was loaded behind it comes straight back, unpaid. The whole game is picking how big a stack to load — and the right stack depends on the route.

That is exactly the knob a diffusion LLM faces. A diffusion model can draft a whole block of future tokens in parallel and then verify them in one pass, keeping the accepted prefix. The block size is how far ahead it drafts before it stops to check. A small block wastes the model's parallelism; a big block drafts tokens so far out that the later ones are low-confidence guesses, and the verifier rejects the tail — the compute spent on it is thrown away.

Here is the catch the paper leans on: the best block size is not a global constant — it varies input by input. An easy, predictable continuation can safely be drafted far ahead; a hard one can't. Prior diffusion speculative-decoding setups typically ship one fixed block size (or one grid-searched value), so that single value is a compromise — too timid on the easy inputs, too greedy on the hard ones.

BlockPilot's move is to stop guessing and read the route. After the prompt is prefilled, a small learned policy looks at that prefill representation and predicts the block size to use for this specific request — one lightweight decision, made once, right after prefilling. The dispatcher glances at the order slip and sizes the run to match — the block size is the one thing the policy sets per input.

Where the speedup comes from. Here is the arithmetic — the 4.20× and 5.92 are the paper's; the reply length and cycle counts below are illustrative. Assume each draft-plus-verify cycle costs about the same regardless of block size, since the draft is parallel and the verify is one pass. Say a conservative fixed block commits only about 1.4 accepted tokens per cycle; then an illustrative 256-token reply takes roughly 256 / 1.4 ≈ 183 cycles. BlockPilot's reported acceptance length of 5.92 means each cycle commits about 5.92 tokens, so the same reply takes roughly 256 / 5.92 ≈ 43 cycles. Same cost per cycle, so the ratio 5.92 / 1.4 ≈ **4.2×** matches the reported 4.20× speedup — the accepted tokens are still the ones the model would have produced, since verification keeps only the tokens that pass; the win is fewer cycles wasted on a mis-sized block.

How does per-input sizing sit next to the fixed strategies it replaces?

Block-size strategyHow the size is chosenAdapts per input?Tuning cost
Fixed small blockone conservative constantnonone — safe but slow
Fixed large blockone aggressive constantnonone — fast when it works, wasteful when acceptance is low
Grid-searched blockone value tuned offlinenoan offline sweep per model
BlockPilot (adaptive)predicted from the prefill, per requestyestrain a small policy once

The honest caveats: the 4.20× and the 5.92 acceptance length are reported on a 4B model at T = 1 — a different model, temperature, or task will land elsewhere on the curve, and the policy is learned rather than dropped in for free. But the idea is general: a diffusion speculative-decoding stack that ships a fixed block size leaves the throughput on the table that a per-input policy can pick back up.

Goes deeper in: LLM Serving → Speculative Decoding → The Draft-Verify Idea

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based