How is PSD different from autoregressive speculative decoding?

Autoregressive speculative decoding speeds only the temporal axis: a small draft model proposes several candidate tokens for the next N positions in sequence, then the target model verifies them in one wider matmul. Every candidate is for the next position. PSD also speeds the spatial axis — committing multiple non-adjacent masked positions per forward pass — which is only possible in a diffusion LLM, where every masked position has a confidence score from the same shared forward pass. An autoregressive transformer never has a score for 'position N+5' because it has not yet generated position N+1. PSD is also training-free: no draft model to ship, no fine-tuning epoch, no extra calibration set.

Where does the 5.5× speedup come from, and what are the caveats?

5.5× is tokens-per-forward-pass, not wall-clock. The peak comes when the diffusion LLM produces many high-confidence positions per pass — on tractable tasks (short generation, well-formed code skeletons) the parallel-decoding policy commits a large fraction of the grid every pass. The honest caveats: 5.5× is the peak across reported settings, not the average; on hard tasks (long-form math, tricky code) the speedup compresses toward the lower end of the range as accept rates drop. Quality is 'comparable to greedy' — a precise hedge that allows small per-task differences. And PSD only helps diffusion LLMs; it does not transfer to standard autoregressive transformers.

PSD paper — Parallel speculative decoding for diffusion LLMs

Q: What is Parallel Speculative Decoding (PSD)?

PSD is a training-free decoding scheme for diffusion LLMs introduced in the May 2026 PSD paper (arXiv:2605.15609). A vanilla diffusion-LLM decode commits one masked position per forward pass, so a 256-token reply costs roughly 256 passes through the model. PSD reuses the per-position confidence scores that every forward pass already produces and commits multiple high-confidence positions per pass — under a configurable parallel-decoding policy the paper exercises across several variants. The same pass also produces a multi-depth draft of further-out positions, kept around for a final batched verification pass that accepts the deepest consistent prefix — the diffusion analogue of autoregressive draft-verify. Reported result: up to 5.5× tokens per forward pass with accuracy comparable to greedy decoding, no fine-tuning and no auxiliary draft model.

PSD paper — Parallel speculative decoding for diffusion LLMs

LLM

learnaivisually.com/ai-explained/psd-parallel-spec-decode-diffusion-llms

Jargon

Diffusion LLM: A language model that generates text by repeatedly unmasking positions in a fixed-length grid, rather than producing tokens one at a time. Standard decode unmasks one position per forward pass.
Forward pass: One full evaluation of the model's weights on the current input. Each pass is the dominant cost in decode — wall-clock time per reply is roughly (passes × pass latency).
Masked position: A slot in the diffusion grid that has not yet been committed to a token. The decoder's job is to fill all masked positions.
Confidence score: The model's per-position softmax peak for the next token at that slot — a built-in signal of how sure the model is about that guess. PSD reuses it as an unmasking trigger.
Adaptive unmasking: The policy that picks which positions to commit per pass, based on the confidence scores. PSD is policy-agnostic — any parallel-decoding policy that picks multiple positions per pass plugs in; the paper reports results across a few such policies.
Multi-depth draft: A speculative draft of positions further out in the sequence, produced by the same forward pass. Verified in a final batched pass that keeps the deepest consistent prefix — the diffusion analogue of autoregressive draft-verify.
Training-free: No fine-tuning, no auxiliary draft model, no calibration data. PSD plugs in on top of an existing diffusion LLM and only changes the decode loop.

The news. On May 15, 2026, the PSD paper appeared on arXiv. It claims up to 5.5× tokens per forward pass on diffusion LLMs with accuracy comparable to greedy decoding, and crucially does it training-free: no draft model, no fine-tuning, no calibration data. The technique was tested across three diffusion LLMs on reasoning and code-generation tasks. Read the paper →

Picture a crossword puzzle solver staring at a half-finished grid. The lazy strategy is to scan left-to-right, top-to-bottom, and fill one square at a time. That is what a vanilla diffusion LLM does — its decode loop commits a single masked position per forward pass, so a 256-token reply costs roughly 256 passes. Each pass is the same forward through the diffusion LLM (the paper evaluates 7B–8B-scale models, which run comfortably on a single GPU at decode).

The fast strategy is what experienced solvers actually do: glance over the whole grid, mentally rank the clues by "how sure am I," and fill in every confident square in one sweep. The hard ones can wait — by the time you come back, neighboring letters have locked in and the previously-hard clues are easy. PSD applies the same intuition to a transformer. A single forward pass already produces a probability distribution over every masked position. PSD reuses those scores as confidence signals and commits multiple confident positions per pass under a configurable parallel-decoding policy (the paper plugs in several).

The trick splits into two axes. On the spatial axis, multiple positions get unmasked per forward pass. On the temporal axis, the same forward pass also produces a multi-depth draft of positions further out, kept around for a final batched verification pass that accepts the deepest consistent prefix — the diffusion analogue of standard draft-verify speculative decoding. The two axes are complementary; the paper reports that combining them is what unlocks the headline speedup, with the spatial axis often the larger lever on tractable tasks.

PSD is also training-free, which is the line the authors emphasize. There is no auxiliary draft model to ship, no fine-tuning epoch, no extra calibration set. It plugs in on top of an existing diffusion LLM and only changes the decode loop — the same kind of swap as turning on a speculative-decoding variant in an inference server.

Where the wall-clock time actually goes. Imagine a 256-token reply from a 30B-class diffusion LLM running at roughly 0.12s per forward pass on one GPU at decode (illustrative — pass cost varies with model size, KV layout, and batch). The baseline strategy unmasks 1 token / pass, so the reply takes 256 × 0.12s ≈ 30.7s. PSD at 5.5× tokens / pass reduces that to about 47 passes × 0.12s ≈ 5.6s (illustrative). The model has not changed and no quality budget was spent — the only thing PSD bought back was the per-pass overhead the baseline was paying for nothing.

How does PSD compare with the spec-decode techniques readers already know? The table below lays out where each method attacks the latency formula.

Technique	Setting	Speedup axis	Extra model?	Training cost
Greedy diffusion decode	diffusion LLM	none (baseline)	no	none
Autoregressive spec decode	autoregressive LLM	temporal only	yes — small draft model	train or pick a draft model
PSD (this paper)	diffusion LLM	spatial + temporal	no	none — plug in

The honest caveats: 5.5× is the peak speedup, not the average — accept rates depend on the underlying diffusion LLM and the task, and on hard tasks (long-form math, tricky code) PSD's speedup compresses toward the lower end of its range. Accuracy is "comparable to greedy", which is a precise hedge — comparable, not identical. And PSD only helps diffusion LLMs; it does not transfer to standard autoregressive transformers, which never have a confidence score for a "position N+5" because they have not generated position N+1 yet.

Goes deeper in: LLM Serving → Speculative Decoding → The Verification Algorithm

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based