PSD paper — Parallel speculative decoding for diffusion LLMs

LLM
L
tokens / forward pass:1??????????????
learnaivisually.com/ai-explained/psd-parallel-spec-decode-diffusion-llms

The news. On May 15, 2026, the PSD paper appeared on arXiv. It claims up to 5.5× tokens per forward pass on diffusion LLMs with accuracy comparable to greedy decoding, and crucially does it training-free: no draft model, no fine-tuning, no calibration data. The technique was tested across three diffusion LLMs on reasoning and code-generation tasks. Read the paper →

Picture a crossword puzzle solver staring at a half-finished grid. The lazy strategy is to scan left-to-right, top-to-bottom, and fill one square at a time. That is what a vanilla diffusion LLM does — its decode loop commits a single masked position per forward pass, so a 256-token reply costs roughly 256 passes. Each pass is the same forward through the diffusion LLM (the paper evaluates 7B–8B-scale models, which run comfortably on a single GPU at decode).

The fast strategy is what experienced solvers actually do: glance over the whole grid, mentally rank the clues by "how sure am I," and fill in every confident square in one sweep. The hard ones can wait — by the time you come back, neighboring letters have locked in and the previously-hard clues are easy. PSD applies the same intuition to a transformer. A single forward pass already produces a probability distribution over every masked position. PSD reuses those scores as confidence signals and commits multiple confident positions per pass under a configurable parallel-decoding policy (the paper plugs in several).

The trick splits into two axes. On the spatial axis, multiple positions get unmasked per forward pass. On the temporal axis, the same forward pass also produces a multi-depth draft of positions further out, kept around for a final batched verification pass that accepts the deepest consistent prefix — the diffusion analogue of standard draft-verify speculative decoding. The two axes are complementary; the paper reports that combining them is what unlocks the headline speedup, with the spatial axis often the larger lever on tractable tasks.

PSD is also training-free, which is the line the authors emphasize. There is no auxiliary draft model to ship, no fine-tuning epoch, no extra calibration set. It plugs in on top of an existing diffusion LLM and only changes the decode loop — the same kind of swap as turning on a speculative-decoding variant in an inference server.

Where the wall-clock time actually goes. Imagine a 256-token reply from a 30B-class diffusion LLM running at roughly 0.12s per forward pass on one GPU at decode (illustrative — pass cost varies with model size, KV layout, and batch). The baseline strategy unmasks 1 token / pass, so the reply takes 256 × 0.12s ≈ 30.7s. PSD at 5.5× tokens / pass reduces that to about 47 passes × 0.12s ≈ 5.6s (illustrative). The model has not changed and no quality budget was spent — the only thing PSD bought back was the per-pass overhead the baseline was paying for nothing.

How does PSD compare with the spec-decode techniques readers already know? The table below lays out where each method attacks the latency formula.

TechniqueSettingSpeedup axisExtra model?Training cost
Greedy diffusion decodediffusion LLMnone (baseline)nonone
Autoregressive spec decodeautoregressive LLMtemporal onlyyes — small draft modeltrain or pick a draft model
PSD (this paper)diffusion LLMspatial + temporalnonone — plug in

The honest caveats: 5.5× is the peak speedup, not the average — accept rates depend on the underlying diffusion LLM and the task, and on hard tasks (long-form math, tricky code) PSD's speedup compresses toward the lower end of its range. Accuracy is "comparable to greedy", which is a precise hedge — comparable, not identical. And PSD only helps diffusion LLMs; it does not transfer to standard autoregressive transformers, which never have a confidence score for a "position N+5" because they have not generated position N+1 yet.

Goes deeper in: LLM Serving → Speculative Decoding → The Verification Algorithm

Frequently Asked Questions