The news. On June 10, 2026, Google released DiffusionGemma, an Apache-2.0 model that generates text by iterative denoising rather than left-to-right sampling. It seeds a block with placeholder tokens and refines 256 tokens in parallel per forward pass using bidirectional attention, reaching 1000+ tokens/sec on an H100 and 700+ tokens/sec on an RTX 5090, and fitting in 18 GB of VRAM when quantized. It is a 26B-parameter mixture-of-experts model with about 3.8B active. Read the announcement →

Picture two machines printing the same paragraph. The first is a dot-matrix printer: it types one character left-to-right and the next character can't start until the last one lands — that is autoregressive decoding, the way nearly every LLM you have used writes one token at a time. The second is a Polaroid: the whole photo comes out at once, blank and blurry, then sharpens everywhere simultaneously over a few seconds. DiffusionGemma is the Polaroid. It lays down a whole block of placeholder tokens up front and then develops them in parallel, so the paragraph appears all at once and gets clearer with each pass.

Underneath the metaphor, "developing the photo" is iterative denoising. The model seeds a block with 256 noisy placeholder slots, then makes several refinement passes; each pass locks in the tokens it is now confident about and re-evaluates the rest. The trick that makes this legal is bidirectional attention — dropping the causal mask that forces a normal decoder to only look backward. Because every slot can attend to every other slot, future included, the model can self-correct an early token using words that only got resolved later. A left-to-right decoder can never do that: once it commits token 5, tokens 6 onward can lean on it, but it can't lean on them.

PropertyAutoregressive (standard Gemma)Parallel block decoding (DiffusionGemma)
How a token is producedpredict the single next token, append, repeatseed a block of placeholders, denoise all at once
Tokens per forward pass1256 (Google)
Attentioncausal (look backward only)bidirectional (look both ways)
Can fix an earlier token?no — already committedyes — re-evaluated each pass
Reported decode speedbaselineup to ~4x faster, 1000+ tok/s on H100 (Google, reported)

Why does generating in blocks win? Run the numbers on a 512-token answer (illustrative). The autoregressive printer needs 512 forward passes — one per token, each stalled waiting on the previous, which is exactly why decode is the latency-bound, memory-starved phase of LLM inference. DiffusionGemma instead lays those 512 tokens down as two blocks of 256 and refines each over a handful of denoising passes — say ~16 passes total (illustrative; Google reports the speedup, not the pass count). That collapses hundreds of strictly-sequential steps into a few parallel ones, and a parallel-friendly pass keeps the GPU busy, which is where the up to 4x faster decode and 1000+ tokens/sec on an H100 come from.

The catch is that each denoising pass is heavier than a single autoregressive step. Bidirectional attention re-reads the whole block every pass, so it can't reuse a backward-only KV cache the way a causal decoder does, and the headline 4x is measured on dedicated GPUs where that parallel work has lanes to fill. DiffusionGemma offsets the cost with a mixture-of-experts design — 26B total parameters but only ~3.8B active per token — and ships in 18 GB of VRAM when quantized, so it still fits a high-end consumer GPU such as the RTX 5090 the source benchmarks. The payoff is a different shape of LLM: not a faster printer, but a model that drafts a paragraph all at once and sharpens it — a live, open-weight alternative to left-to-right decoding.

Goes deeper in: LLM Internals → Text Generation → One Token at a Time

Related explainers

Continue in trackLLM Internals — Text Generation: how autoregressive decoding works

Frequently Asked Questions