dMoE — block-level expert routing for diffusion MoE
LLMThe news. On May 29, 2026, researchers posted the dMoE paper to arXiv. It targets a mismatch: diffusion LLMs decode a block of tokens in parallel, but Mixture-of-Experts routing picks experts per token, so a block ends up activating the union of everyone's choices. dMoE pools the per-token router logits into one block-level expert distribution, shrinking unique activated experts per block from 69.5 to 14.6. The authors report 76.64–79.84% less memory, 99.11% of original quality retained, and a 1.14–1.66× end-to-end speedup. The work is in progress; code is on GitHub. Read the paper →
Picture a table of six friends at a restaurant. Order à la carte and each diner picks a different dish, so the kitchen has to heat the grill and the fryer and the pasta station and the sushi counter — almost every station fires to cover one table. Now order family-style: the table agrees on a few shared plates, and only two or three stations light up. The food is just as good; the kitchen does a fraction of the work. That is the whole idea behind dMoE — the diners are a block of tokens, the stations are experts, and the question is whether each token orders alone or the block shares one order.
Here is why the à la carte bill gets so big. A diffusion LLM doesn't emit one token at a time — it decodes a whole block in parallel. Standard MoE routing then runs per token: each of the block's tokens independently picks its own top-k experts. Because the tokens disagree about which experts they want, the block has to load the union of all their picks — the paper measures ~69.5 distinct experts per block, even though any single token uses only k of them. Every one of those experts' weights must be paged into fast memory before it can run, and on an MoE that paging — not the arithmetic — is what dominates. dMoE changes the order: it pools the per-token router logits into one block-level distribution, so the whole block commits to a single shared set of experts and the kitchen only heats the few stations everyone agreed on.
That distinction matters because, on a Mixture-of-Experts model, memory — not raw compute — caps how big a model you can run. It's the same pressure quantization attacks from a different angle: quantization shrinks each weight to fewer bits, while dMoE shrinks how many distinct experts you touch at all. The two are complementary — one makes each expert cheaper to store, the other makes the block visit fewer of them — and dMoE's lever is the one that keeps a diffusion-MoE from over-activating its way out of the memory budget.
Where the memory actually goes
Take a block of 32 tokens, each routing to its top-8 of 128 experts (illustrative sizes — the paper fixes neither). That is 32 × 8 = 256 expert picks spread over a pool of 128, and because the picks overlap heavily a smaller set actually lights up — the paper measures ~69.5 distinct experts per block. Each expert's feed-forward weights — say ~44 MB (illustrative) — must be paged into HBM before it runs, so the block pays for roughly 69.5 × 44 MB ≈ 3.1 GB of weight traffic. Pool those 32 routing decisions into one block-level distribution and the block commits to ~14.6 experts: the same ~44 MB × 14.6 ≈ 640 MB, about a fifth of the traffic — the reported 77–80% cut — while quality holds at 99.11%. The arithmetic each expert does is unchanged; what shrinks is how many of them you had to wake up.
| Routing scheme | Experts loaded / block | Expert-weight memory | Quality |
|---|---|---|---|
| Per-token top-k (standard MoE) | ~69.5 | baseline (1×) | reference |
| Block-level pooled (dMoE) | ~14.6 (paper) | −76.6 to −79.8% (paper) | 99.11% retained (paper) |
A caveat worth stating plainly: these are the paper's own reported numbers on its models, and the work is flagged in-progress. The 99.11% figure says quality barely moves on their benchmarks, not that block-level routing is free everywhere — pooling throws away some per-token specialization, and how much that costs depends on how strongly a model's tokens actually want different experts. The honest reading is that dMoE exploits a specific redundancy created by parallel block decoding, and the size of the win scales with how much over-activation your diffusion-MoE was paying for in the first place.
Goes deeper in: GPU & CUDA → Memory Hierarchy → HBM: Where Your Model Lives
Related explainers
- ZEDA — zero-output expert self-distillation — another way to cut MoE expert FLOPs, by teaching the router to skip experts entirely
- MobileMoE — DRAM-aware MoE scaling — fitting an MoE under a hard memory budget on-device, the same pressure from a scaling-law angle
- PSD — parallel speculative decoding for diffusion LLMs — the other half of the intersection: speeding up the parallel block decode itself