ZEDA — Zero-output expert self-distillation
LLMThe news. On May 19, 2026, researchers from Tsinghua C3I, Shanghai AI Lab, Frontis.AI, Kuaishou Technology, and WeChat AI released the ZEDA paper. Tested across 11 benchmarks spanning math, code, and instruction-following on Qwen3-30B-A3B and GLM-4.7-Flash, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss, outperforms the strongest dynamic-MoE baseline by 6.1 and 4.0 points respectively, and delivers a ~1.20× end-to-end inference speedup. No pre-training from scratch, no task-specific adaptation — just a router adaptation on top of an already-trained MoE.
Picture the checkpoint. Every traveler today has to walk through the full screening line — bags out, laptops out, jackets off — regardless of whether they are carrying a backpack of climbing gear or a wallet and a phone. The line is staffed for the worst case, and it pays the worst-case cost for every traveler. A normal top-K MoE behaves exactly this way: the router picks K real experts for every token, even tokens whose continuation is so predictable that nearly any expert would land the same answer.
ZEDA opens an express lane next to the existing lanes. The express lane is special in that it has no staff at all — it is a parameter-free zero-output expert that produces a zero vector and lets the token pass through. Building the lane costs essentially nothing because there are no weights to train and no compute to run. The interesting part is that the lane now exists as a normal routing target the router can score, just like the other experts in the FFN block.
The router is then retrained, not from scratch but by self-distillation against the original frozen MoE. The original MoE plays teacher: for each batch it produces the reference output the augmented model must match. The augmented model (student) learns when sending a token to the express lane will keep the layer's output close enough to the teacher's — and routes it there when it can. A group-level balancing loss keeps the router from collapsing onto a single lane and forces it to keep using the real experts where they actually carry the signal. Stage 1 trains only the routing; stage 2 jointly fine-tunes routing and the surviving real experts on the same objective.
The visible result is a router that learns a far less symmetric policy than top-K. Easy tokens — common syntactic glue, predictable continuations, repeated tokens in long generations — land on the zero-output expert. Hard tokens — points where the continuation actually depends on the expert's specialised capacity — keep going to real experts. The original MoE's expert weights are mostly untouched. What changed is the traffic pattern through the layer.
| Approach | When routing is decided | Expert FLOPs per token | Needs re-pretraining? |
|---|---|---|---|
| Static top-K MoE | Frozen at training | Always K real experts | Yes (original pretraining) |
| Existing dynamic-MoE baselines | Learned input-dependent routing | Lower than static; method / model dependent | Yes, from scratch |
| ZEDA | Adapts a finished MoE post-hoc | ~50% of static (50%+ eliminated, per the paper) | No — two-stage self-distillation only |
A worked-example sense of where the savings come from. Imagine 8 tokens passing through one MoE layer of Qwen3-30B-A3B (top-K with K=8). Before ZEDA, the router sends each token to K real experts, so the layer burns K × 8 = 64 expert invocations (illustrative count). After ZEDA, roughly 4 of 8 tokens are easy enough that the router learns to route some of their K slots to the zero-output expert. With half of the per-token expert slots replaced by the skip-lane, the same layer now performs only about 32 real-expert invocations — a 50% drop in expert FLOPs (illustrative count, matching the paper's "50%+ eliminated" claim). At full-model scale the paper reports a ~1.20× end-to-end inference speedup on Qwen3-30B-A3B and GLM-4.7-Flash and a 6.1 / 4.0 point margin over the strongest dynamic-MoE baseline across 11 benchmarks.
The pedagogical lesson is what's not in the recipe. ZEDA does not add a new expert module, change attention, touch tokenisation, or pre-train anything. It changes the menu of routing destinations and re-fits the router. That's a small intervention, but it changes the serving cost profile of an already-deployed MoE — the expert subcompute is roughly halved, while the full serving path is ~1.20× faster in the paper. The mechanism is designed for top-K MoEs and demonstrated here on Qwen3-30B-A3B and GLM-4.7-Flash; whether it carries to other MoE architectures or routing schemes is an empirical question the paper does not answer.
Goes deeper in: LLM Internals → Transformer Block → FFN
Related explainers
- IBM Granite 4.1 — 8B dense matches the prior 32B MoE — why MoE active-parameter counts matter for inference cost in the first place.
- NVIDIA Nemotron 3 Nano Omni — 30B-A3B multimodal MoE — a recent example of the 30B-A3B sparsity pattern ZEDA is pruning further.
- CoPD paper — Co-evolving Policy Distillation between parallel experts — a different distillation mechanism that runs between experts rather than between routing decisions.