The news. On June 30, 2026, the BlockPilot paper appeared on arXiv. It speeds up diffusion-based speculative decoding by learning a lightweight policy that reads a sample's prefill representation and predicts the draft block size to use for that one input. On a 4B model at temperature T = 1 it reports an acceptance length of 5.92 and a 4.20× decoding speedup. Read the paper →
Picture the delivery dispatcher at a busy depot. Orders pile up in a queue, and the dispatcher loads a stack of them onto each scooter run. Load too few and the scooter is back in five minutes for another trip — barely faster than one-at-a-time. Load too many and the route falls apart partway: the customer check keeps every order up to the first wrong one, and whatever was loaded behind it comes straight back, unpaid. The whole game is picking how big a stack to load — and the right stack depends on the route.
That is exactly the knob a diffusion LLM faces. A diffusion model can draft a whole block of future tokens in parallel and then verify them in one pass, keeping the accepted prefix. The block size is how far ahead it drafts before it stops to check. A small block wastes the model's parallelism; a big block drafts tokens so far out that the later ones are low-confidence guesses, and the verifier rejects the tail — the compute spent on it is thrown away.
Here is the catch the paper leans on: the best block size is not a global constant — it varies input by input. An easy, predictable continuation can safely be drafted far ahead; a hard one can't. Prior diffusion speculative-decoding setups typically ship one fixed block size (or one grid-searched value), so that single value is a compromise — too timid on the easy inputs, too greedy on the hard ones.
BlockPilot's move is to stop guessing and read the route. After the prompt is prefilled, a small learned policy looks at that prefill representation and predicts the block size to use for this specific request — one lightweight decision, made once, right after prefilling. The dispatcher glances at the order slip and sizes the run to match — the block size is the one thing the policy sets per input.
Where the speedup comes from. Here is the arithmetic — the 4.20× and 5.92 are the paper's; the reply length and cycle counts below are illustrative. Assume each draft-plus-verify cycle costs about the same regardless of block size, since the draft is parallel and the verify is one pass. Say a conservative fixed block commits only about 1.4 accepted tokens per cycle; then an illustrative 256-token reply takes roughly 256 / 1.4 ≈ 183 cycles. BlockPilot's reported acceptance length of 5.92 means each cycle commits about 5.92 tokens, so the same reply takes roughly 256 / 5.92 ≈ 43 cycles. Same cost per cycle, so the ratio 5.92 / 1.4 ≈ **4.2×** matches the reported 4.20× speedup — the accepted tokens are still the ones the model would have produced, since verification keeps only the tokens that pass; the win is fewer cycles wasted on a mis-sized block.
How does per-input sizing sit next to the fixed strategies it replaces?
| Block-size strategy | How the size is chosen | Adapts per input? | Tuning cost |
|---|---|---|---|
| Fixed small block | one conservative constant | no | none — safe but slow |
| Fixed large block | one aggressive constant | no | none — fast when it works, wasteful when acceptance is low |
| Grid-searched block | one value tuned offline | no | an offline sweep per model |
| BlockPilot (adaptive) | predicted from the prefill, per request | yes | train a small policy once |
The honest caveats: the 4.20× and the 5.92 acceptance length are reported on a 4B model at T = 1 — a different model, temperature, or task will land elsewhere on the curve, and the policy is learned rather than dropped in for free. But the idea is general: a diffusion speculative-decoding stack that ships a fixed block size leaves the throughput on the table that a per-input policy can pick back up.
Goes deeper in: LLM Serving → Speculative Decoding → The Draft-Verify Idea
Related explainers
- DiffusionGemma — Parallel block decoding — the base parallel-block decode whose block size BlockPilot learns to set
- PSD — Parallel speculative decoding for diffusion LLMs — an orthogonal axis: PSD unmasks more positions per pass; BlockPilot sizes how far each pass drafts
- Spec-decode latency — Load-dependent latency model — the caveat that even a perfectly sized block loses its margin once the server saturates