ZEDA — Zero-Expert Self-Distillation Adaptation — is a post-training recipe from Tsinghua C3I, Shanghai AI Lab, Frontis.AI, Kuaishou, and WeChat AI that converts a fully trained static Mixture-of-Experts model into a dynamic one. It injects parameter-free zero-output experts as new routing targets and uses two-stage self-distillation against the original frozen MoE to teach the router when to use them. The paper reports over 50% of expert FLOPs eliminated at marginal accuracy loss and a ~1.20× end-to-end inference speedup on Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, plus a 6.1 / 4.0 point margin over the strongest prior dynamic-MoE baseline.

Why use a zero-output expert instead of a real but cheaper expert?

Because a zero-output expert has no parameters to learn and no compute to run — routing a token to it is mathematically equivalent to skipping the expert. A cheaper real expert would still consume FLOPs and would have to be pretrained, which defeats ZEDA's post-hoc, no-re-pretraining premise. The zero-output lane is the smallest possible intervention that lets the router choose to do nothing without breaking the routing abstraction, so it sits naturally alongside compatible top-K MoEs subject to implementation details.

How is ZEDA different from CoPD or other expert-distillation methods?

CoPD distils between parallel experts while they co-evolve, so the experts are the students and the distillation changes what each expert can do. ZEDA distils between two copies of the same MoE — the original frozen MoE as teacher and a copy augmented with zero-output experts as student — and the routing decision is the thing learning. The real experts' weights are largely untouched in stage 1, and fine-tuned in stage 2. CoPD changes the experts; ZEDA changes how often the layer uses experts at all.

ZEDA paper — Zero-output expert self-distillation

ZEDA — Zero-output expert self-distillation

LLM

learnaivisually.com/ai-explained/zeda-zero-output-expert-distill

TL;DR

What is it: The ZEDA paper (Tsinghua C3I, Shanghai AI Lab, Frontis.AI, Kuaishou, WeChat AI) presents Zero-Expert Self-Distillation Adaptation — a recipe that converts an already-trained static Mixture-of-Experts model into a dynamic one by injecting parameter-free zero-output experts and adapting the router via self-distillation.
Why it’s needed: Big MoE models route every token to a top-K real expert even when the work is trivial, so ZEDA eliminates over half of expert FLOPs at marginal accuracy loss and delivers a ~1.20× end-to-end inference speedup on Qwen3-30B-A3B and GLM-4.7-Flash without re-pretraining.
vs previous: Earlier dynamic-MoE work trained input-dependent routing from scratch, while ZEDA adapts a finished MoE post-hoc and still beats the strongest prior dynamic-MoE baseline by 6.1 / 4.0 points on Qwen3-30B-A3B and GLM-4.7-Flash.

Jargon

MoE: Mixture of Experts — a transformer layer where the dense feed-forward network is replaced by N parallel sub-FFNs ("experts") and a small router that picks the top-K for each token. The parameter count is large but the active compute per token is only K of N. Qwen3-30B-A3B has 30B total parameters and ~3B active per token.
Expert FLOPs: The floating-point operations spent inside the MoE experts on a given token — the dominant cost of an MoE forward pass at inference. ZEDA targets this number directly: skipping an expert means skipping its FLOPs.
Zero-output expert: A parameter-free "expert" inserted alongside the existing N experts. It has no weights to learn and no compute to run — routing a token to it is mathematically equivalent to skipping that expert for that token, but it lives inside the same router decision and so behaves like a normal routing target.
Self-distillation: A distillation setup where the teacher and student are the same model architecture, just at different stages. ZEDA freezes the original MoE as teacher and lets a copy augmented with zero-output experts learn (as student) to match the teacher's outputs while learning when it can use the skip-lane.
Two-stage self-distillation: ZEDA's training procedure. Stage 1 trains only the router on the augmented graph so it learns to use the new zero-output experts. Stage 2 jointly fine-tunes the routing and the surviving real experts on the same self-distillation objective. Stages are kept separate so routing doesn't collide with expert updates.
Group-level balancing loss: An auxiliary loss that keeps routing from collapsing — pushing tokens to spread across experts at a group granularity (not per-token) so the model still uses its capacity instead of funneling everything to the cheapest expert or only to the skip-lane.
Top-K routing: The default MoE routing scheme: for each token the router scores every expert and sends the token to the top K (typically K=1 or K=2). Cost per token is fixed at K experts regardless of whether the token needs that much capacity.

The news. On May 19, 2026, researchers from Tsinghua C3I, Shanghai AI Lab, Frontis.AI, Kuaishou Technology, and WeChat AI released the ZEDA paper. Tested across 11 benchmarks spanning math, code, and instruction-following on Qwen3-30B-A3B and GLM-4.7-Flash, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss, outperforms the strongest dynamic-MoE baseline by 6.1 and 4.0 points respectively, and delivers a ~1.20× end-to-end inference speedup. No pre-training from scratch, no task-specific adaptation — just a router adaptation on top of an already-trained MoE.

Picture the checkpoint. Every traveler today has to walk through the full screening line — bags out, laptops out, jackets off — regardless of whether they are carrying a backpack of climbing gear or a wallet and a phone. The line is staffed for the worst case, and it pays the worst-case cost for every traveler. A normal top-K MoE behaves exactly this way: the router picks K real experts for every token, even tokens whose continuation is so predictable that nearly any expert would land the same answer.

ZEDA opens an express lane next to the existing lanes. The express lane is special in that it has no staff at all — it is a parameter-free zero-output expert that produces a zero vector and lets the token pass through. Building the lane costs essentially nothing because there are no weights to train and no compute to run. The interesting part is that the lane now exists as a normal routing target the router can score, just like the other experts in the FFN block.

The router is then retrained, not from scratch but by self-distillation against the original frozen MoE. The original MoE plays teacher: for each batch it produces the reference output the augmented model must match. The augmented model (student) learns when sending a token to the express lane will keep the layer's output close enough to the teacher's — and routes it there when it can. A group-level balancing loss keeps the router from collapsing onto a single lane and forces it to keep using the real experts where they actually carry the signal. Stage 1 trains only the routing; stage 2 jointly fine-tunes routing and the surviving real experts on the same objective.

The visible result is a router that learns a far less symmetric policy than top-K. Easy tokens — common syntactic glue, predictable continuations, repeated tokens in long generations — land on the zero-output expert. Hard tokens — points where the continuation actually depends on the expert's specialised capacity — keep going to real experts. The original MoE's expert weights are mostly untouched. What changed is the traffic pattern through the layer.

Approach	When routing is decided	Expert FLOPs per token	Needs re-pretraining?
Static top-K MoE	Frozen at training	Always K real experts	Yes (original pretraining)
Existing dynamic-MoE baselines	Learned input-dependent routing	Lower than static; method / model dependent	Yes, from scratch
ZEDA	Adapts a finished MoE post-hoc	~50% of static (50%+ eliminated, per the paper)	No — two-stage self-distillation only

A worked-example sense of where the savings come from. Imagine 8 tokens passing through one MoE layer of Qwen3-30B-A3B (top-K with K=8). Before ZEDA, the router sends each token to K real experts, so the layer burns K × 8 = 64 expert invocations (illustrative count). After ZEDA, roughly 4 of 8 tokens are easy enough that the router learns to route some of their K slots to the zero-output expert. With half of the per-token expert slots replaced by the skip-lane, the same layer now performs only about 32 real-expert invocations — a 50% drop in expert FLOPs (illustrative count, matching the paper's "50%+ eliminated" claim). At full-model scale the paper reports a ~1.20× end-to-end inference speedup on Qwen3-30B-A3B and GLM-4.7-Flash and a 6.1 / 4.0 point margin over the strongest dynamic-MoE baseline across 11 benchmarks.

The pedagogical lesson is what's not in the recipe. ZEDA does not add a new expert module, change attention, touch tokenisation, or pre-train anything. It changes the menu of routing destinations and re-fits the router. That's a small intervention, but it changes the serving cost profile of an already-deployed MoE — the expert subcompute is roughly halved, while the full serving path is ~1.20× faster in the paper. The mechanism is designed for top-K MoEs and demonstrated here on Qwen3-30B-A3B and GLM-4.7-Flash; whether it carries to other MoE architectures or routing schemes is an empirical question the paper does not answer.

Goes deeper in: LLM Internals → Transformer Block → FFN

Related explainers

IBM Granite 4.1 — 8B dense matches the prior 32B MoE — why MoE active-parameter counts matter for inference cost in the first place.
NVIDIA Nemotron 3 Nano Omni — 30B-A3B multimodal MoE — a recent example of the 30B-A3B sparsity pattern ZEDA is pruning further.
CoPD paper — Co-evolving Policy Distillation between parallel experts — a different distillation mechanism that runs between experts rather than between routing decisions.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based