What is block-level expert routing?

It is the routing scheme in the dMoE paper for Mixture-of-Experts models running inside a diffusion LLM. Instead of letting each token in a decoded block pick its own top-k experts independently, dMoE pools the per-token router logits into one block-level distribution, so the whole block commits to a single small set of experts. That drops unique activated experts per block from ~69.5 to ~14.6.

Why does dMoE cut memory by ~80%?

On a Mixture-of-Experts model the dominant cost is paging each distinct expert's weights into fast memory before it can run. Per-token routing across a parallel block loads the union of every token's picks (~69.5 experts), while block-level routing loads only the shared set (~14.6). Fewer distinct experts paged in means far less weight traffic — the paper reports a 76.64–79.84% reduction — with 99.11% of quality retained and a 1.14–1.66× end-to-end speedup.

How is dMoE different from standard MoE routing?

Standard MoE routing is per token: the router scores experts and selects the top-k for each token separately. That is fine for autoregressive models that emit one token at a time, but a diffusion LLM decodes a block in parallel, so independent per-token routing over-activates experts across the block. dMoE keeps the same experts and the same per-expert math; it only changes the granularity of the decision from per-token to per-block, which removes the redundant expert loads.

dMoE cuts diffusion-LLM MoE memory ~80% — block-level expert routing

dMoE — block-level expert routing for diffusion MoE

LLM

learnaivisually.com/ai-explained/dmoe-block-level-expert-routing

Jargon

Diffusion LLM (dLLM): A language model that decodes a block of tokens in parallel with bidirectional attention and refines them over a few steps — unlike a standard autoregressive model that emits one token at a time.
MoE (Mixture of Experts): A layer whose feed-forward network is split into many expert sub-networks. A small router activates only a few experts per input, so the model has huge total capacity but runs only a slice of it each step.
Top-k routing: The router scores every expert for an input and keeps the top-k (often k=8). Only those experts run — and only their weights need to be loaded — for that input.
Expert over-activation: When independent per-token routing across a parallel block touches far more unique experts than any single token actually needs — the paper measures ~69.5 distinct experts per block where one token uses only k.
Weight paging: Moving an expert's weights into fast on-chip memory (HBM) before it can compute. The number of distinct experts a block touches drives this cost — and it is exactly what dMoE shrinks.
Block-level distribution: dMoE's pooled routing decision: instead of one expert distribution per token, the per-token router logits are aggregated into one distribution for the whole block, so the block commits to a single small expert set.

The news. On May 29, 2026, researchers posted the dMoE paper to arXiv. It targets a mismatch: diffusion LLMs decode a block of tokens in parallel, but Mixture-of-Experts routing picks experts per token, so a block ends up activating the union of everyone's choices. dMoE pools the per-token router logits into one block-level expert distribution, shrinking unique activated experts per block from 69.5 to 14.6. The authors report 76.64–79.84% less memory, 99.11% of original quality retained, and a 1.14–1.66× end-to-end speedup. The work is in progress; code is on GitHub. Read the paper →

Picture a table of six friends at a restaurant. Order à la carte and each diner picks a different dish, so the kitchen has to heat the grill and the fryer and the pasta station and the sushi counter — almost every station fires to cover one table. Now order family-style: the table agrees on a few shared plates, and only two or three stations light up. The food is just as good; the kitchen does a fraction of the work. That is the whole idea behind dMoE — the diners are a block of tokens, the stations are experts, and the question is whether each token orders alone or the block shares one order.

Here is why the à la carte bill gets so big. A diffusion LLM doesn't emit one token at a time — it decodes a whole block in parallel. Standard MoE routing then runs per token: each of the block's tokens independently picks its own top-k experts. Because the tokens disagree about which experts they want, the block has to load the union of all their picks — the paper measures ~69.5 distinct experts per block, even though any single token uses only k of them. Every one of those experts' weights must be paged into fast memory before it can run, and on an MoE that paging — not the arithmetic — is what dominates. dMoE changes the order: it pools the per-token router logits into one block-level distribution, so the whole block commits to a single shared set of experts and the kitchen only heats the few stations everyone agreed on.

That distinction matters because, on a Mixture-of-Experts model, memory — not raw compute — caps how big a model you can run. It's the same pressure quantization attacks from a different angle: quantization shrinks each weight to fewer bits, while dMoE shrinks how many distinct experts you touch at all. The two are complementary — one makes each expert cheaper to store, the other makes the block visit fewer of them — and dMoE's lever is the one that keeps a diffusion-MoE from over-activating its way out of the memory budget.

Where the memory actually goes

Take a block of 32 tokens, each routing to its top-8 of 128 experts (illustrative sizes — the paper fixes neither). That is 32 × 8 = 256 expert picks spread over a pool of 128, and because the picks overlap heavily a smaller set actually lights up — the paper measures ~69.5 distinct experts per block. Each expert's feed-forward weights — say ~44 MB (illustrative) — must be paged into HBM before it runs, so the block pays for roughly 69.5 × 44 MB ≈ 3.1 GB of weight traffic. Pool those 32 routing decisions into one block-level distribution and the block commits to ~14.6 experts: the same ~44 MB × 14.6 ≈ 640 MB, about a fifth of the traffic — the reported 77–80% cut — while quality holds at 99.11%. The arithmetic each expert does is unchanged; what shrinks is how many of them you had to wake up.

Routing scheme	Experts loaded / block	Expert-weight memory	Quality
Per-token top-k (standard MoE)	~69.5	baseline (1×)	reference
Block-level pooled (dMoE)	~14.6 (paper)	−76.6 to −79.8% (paper)	99.11% retained (paper)

A caveat worth stating plainly: these are the paper's own reported numbers on its models, and the work is flagged in-progress. The 99.11% figure says quality barely moves on their benchmarks, not that block-level routing is free everywhere — pooling throws away some per-token specialization, and how much that costs depends on how strongly a model's tokens actually want different experts. The honest reading is that dMoE exploits a specific redundancy created by parallel block decoding, and the size of the win scales with how much over-activation your diffusion-MoE was paying for in the first place.

Goes deeper in: GPU & CUDA → Memory Hierarchy → HBM: Where Your Model Lives

Related explainers

ZEDA — zero-output expert self-distillation — another way to cut MoE expert FLOPs, by teaching the router to skip experts entirely
MobileMoE — DRAM-aware MoE scaling — fitting an MoE under a hard memory budget on-device, the same pressure from a scaling-law angle
PSD — parallel speculative decoding for diffusion LLMs — the other half of the intersection: speeding up the parallel block decode itself

Continue in trackBatching: Memory Limits the Batch

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based