Why does the draft block size matter so much?

In diffusion speculative decoding the block size sets how many tokens the model drafts in parallel before it stops to verify. Too small and it barely beats one-token-at-a-time decoding; too large and the far-out tokens are low-confidence, so the verifier rejects the tail and the compute spent drafting it is wasted. The best size is a balance that depends on how predictable each input is, which is why a single fixed value is a poor compromise.

How is BlockPilot different from PSD or standard speculative decoding?

Standard speculative decoding and PSD focus on how tokens are drafted and verified — PSD, for instance, commits multiple positions per forward pass in a diffusion LLM. BlockPilot is orthogonal: it does not change the draft-verify mechanism, it changes how far ahead you draft, learning that block size per input rather than fixing it. A strong parallel drafter and BlockPilot's adaptive sizing can be combined.

BlockPilot gives diffusion speculative decoding 4.2× — Instance-adaptive draft block sizing

Q: What is instance-adaptive draft block sizing?

It is BlockPilot's technique for diffusion-based speculative decoding: instead of using one fixed draft block size for every request, a small learned policy reads the prompt's prefill representation and predicts the block size to use for that specific input. Choosing the size per input, rather than globally, lifts the average acceptance length to a reported 5.92 and gives a 4.20× decoding speedup on a 4B model at temperature T=1.

Jargon

Diffusion LLM: A language model that generates a whole span of tokens in parallel and refines them, instead of one token at a time. That parallel span is what makes a big draft block possible.
Speculative decoding: Draft several future tokens cheaply, then have the model verify them in one pass and keep the accepted ones. Full primer →
Draft block / block size: The number of tokens drafted before the model stops to verify. Small blocks barely save time; large blocks risk a low-confidence tail that gets rejected.
Acceptance length: The average number of drafted tokens the verifier keeps per step. Higher acceptance means more real progress per forward pass — BlockPilot reports 5.92.
Prefill representation: The model's internal state right after it reads the prompt, before it writes any output. BlockPilot reads this once to size the block.
Instance-adaptive policy: A tiny learned model that outputs a decision per input rather than using one global setting. Here it predicts the block size for this request.
Temperature (T): A knob on output randomness. The headline numbers are reported at T = 1 (full sampling).

The news. On June 30, 2026, the BlockPilot paper appeared on arXiv. It speeds up diffusion-based speculative decoding by learning a lightweight policy that reads a sample's prefill representation and predicts the draft block size to use for that one input. On a 4B model at temperature T = 1 it reports an acceptance length of 5.92 and a 4.20× decoding speedup. Read the paper →

Picture the delivery dispatcher at a busy depot. Orders pile up in a queue, and the dispatcher loads a stack of them onto each scooter run. Load too few and the scooter is back in five minutes for another trip — barely faster than one-at-a-time. Load too many and the route falls apart partway: the customer check keeps every order up to the first wrong one, and whatever was loaded behind it comes straight back, unpaid. The whole game is picking how big a stack to load — and the right stack depends on the route.

That is exactly the knob a diffusion LLM faces. A diffusion model can draft a whole block of future tokens in parallel and then verify them in one pass, keeping the accepted prefix. The block size is how far ahead it drafts before it stops to check. A small block wastes the model's parallelism; a big block drafts tokens so far out that the later ones are low-confidence guesses, and the verifier rejects the tail — the compute spent on it is thrown away.

Here is the catch the paper leans on: the best block size is not a global constant — it varies input by input. An easy, predictable continuation can safely be drafted far ahead; a hard one can't. Prior diffusion speculative-decoding setups typically ship one fixed block size (or one grid-searched value), so that single value is a compromise — too timid on the easy inputs, too greedy on the hard ones.

BlockPilot's move is to stop guessing and read the route. After the prompt is prefilled, a small learned policy looks at that prefill representation and predicts the block size to use for this specific request — one lightweight decision, made once, right after prefilling. The dispatcher glances at the order slip and sizes the run to match — the block size is the one thing the policy sets per input.

Where the speedup comes from. Here is the arithmetic — the 4.20× and 5.92 are the paper's; the reply length and cycle counts below are illustrative. Assume each draft-plus-verify cycle costs about the same regardless of block size, since the draft is parallel and the verify is one pass. Say a conservative fixed block commits only about 1.4 accepted tokens per cycle; then an illustrative 256-token reply takes roughly 256 / 1.4 ≈ 183 cycles. BlockPilot's reported acceptance length of 5.92 means each cycle commits about 5.92 tokens, so the same reply takes roughly 256 / 5.92 ≈ 43 cycles. Same cost per cycle, so the ratio 5.92 / 1.4 ≈ **4.2×** matches the reported 4.20× speedup — the accepted tokens are still the ones the model would have produced, since verification keeps only the tokens that pass; the win is fewer cycles wasted on a mis-sized block.

How does per-input sizing sit next to the fixed strategies it replaces?

Block-size strategy	How the size is chosen	Adapts per input?	Tuning cost
Fixed small block	one conservative constant	no	none — safe but slow
Fixed large block	one aggressive constant	no	none — fast when it works, wasteful when acceptance is low
Grid-searched block	one value tuned offline	no	an offline sweep per model
BlockPilot (adaptive)	predicted from the prefill, per request	yes	train a small policy once

The honest caveats: the 4.20× and the 5.92 acceptance length are reported on a 4B model at T = 1 — a different model, temperature, or task will land elsewhere on the curve, and the policy is learned rather than dropped in for free. But the idea is general: a diffusion speculative-decoding stack that ships a fixed block size leaves the throughput on the table that a per-input policy can pick back up.

Goes deeper in: LLM Serving → Speculative Decoding → The Draft-Verify Idea

Related explainers

DiffusionGemma — Parallel block decoding — the base parallel-block decode whose block size BlockPilot learns to set
PSD — Parallel speculative decoding for diffusion LLMs — an orthogonal axis: PSD unmasks more positions per pass; BlockPilot sizes how far each pass drafts
Spec-decode latency — Load-dependent latency model — the caveat that even a perfectly sized block loses its margin once the server saturates

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based