Complete-muE — Two-bridge muTransfer for MoE
LLMThe news. On May 22, 2026, Peng et al. posted Complete-muE on arXiv — a two-bridge framework for hyperparameter transfer between dense FFNs and Mixture-of-Experts layers. The paper reports that pretraining experiments on both language and diffusion models confirm small hyperparameter drift across the bridges: tune once on a dense reference, and the same learning rate, initialization, and warmup settings transfer near-optimally to any MoE configuration the team might choose downstream.
Picture a chef who has spent a month perfecting one master recipe for a single kitchen — burners arranged a particular way, one big pot, a fixed mise en place. Hand them a kitchen with eight smaller burners and four pots running in parallel, and the recipe needs to change: the same butter quantity will burn at the new heat profile, the same simmer time will overshoot, the same salt amount will dominate the smaller portions. The chef has two choices. Either retest every ingredient from scratch in the new kitchen — expensive — or carry two conversion charts that translate the master recipe to whatever burner-and-pot configuration the next kitchen happens to have. Complete-muE is the second option for MoE training.
The mechanism is straightforward once the metaphor sits still. The team perfects the recipe on a dense FFN (one big "pot" of width W) using ordinary muP — learning rate, initialization, warmup all sweep on a small dense model and transfer to a larger dense one. So far this is the standard muTransfer playbook teams already run. The new contribution is what happens when the architecture itself changes — when the next training run swaps the dense FFN for an MoE layer with E experts and top-k routing.
The first bridge, active-width scaling, handles the case where the MoE's per-token activated width differs from the dense width. An MoE with 8 experts of width W/8 at top-2 has an active width of W/4 — a quarter of the dense reference's per-token compute. Plain muTransfer is built to handle width changes within one family, so the active-width bridge applies that same width-invariance idea across the architecture boundary from dense to sparse-expert.
The second bridge, activated-expert scaling, is the genuinely new piece. Hold active width constant — say, both configs run at active width W/4 — but vary the expert count E and top-k. An MoE at (E=8, k=2, expert width W/8) and one at (E=16, k=4, expert width W/16) both land at active width W/4, yet they behave differently because the router's load balance, the per-expert effective batch size, and the gradient signal each expert sees all shift with E and k. Activated-expert scaling captures that shift as a parameter-space mapping, so hyperparameters tuned at (E=8, k=2) transfer cleanly to (E=16, k=4) without a fresh sweep. One dense tune plus two bridges covers any MoE in the family.
Where Complete-muE earns its keep
Hold three variables fixed and walk the arithmetic. One team. One trillion-parameter MoE training budget. Three architecture candidates under consideration — a dense 1T baseline, an MoE-A at (E=8, k=2), and an MoE-B at (E=16, k=4). The team needs to pick the best learning rate η for each candidate before committing to the full run. Without Complete-muE, a defensible sweep is three independent muTransfer studies: tune η on small-dense → transfer to dense-1T, tune η on small-MoE-A → transfer to MoE-A-1T, tune η on small-MoE-B → transfer to MoE-B-1T. Each study runs roughly 8 candidate η values at the small scale (illustrative — exact sweep size depends on the team's prior). Total: 3 × 8 = 24 small sweeps.
With Complete-muE, the team runs one small-dense sweep (~8 candidate η values), then applies the two bridges to derive η for both MoE candidates. Total small sweeps: 8. The bridges themselves are cheap — they are closed-form parameter-space mappings, not extra runs. The saving is roughly 3× small-sweep compute for the cost of writing the bridges down once. The paper's headline experimental result is that the bridged hyperparameters show only minor drift from a directly-tuned sweep on both the language-model and the diffusion-model pretraining experiments the authors report — the same bar muTransfer itself targets.
| Setting | Sweeps run | Comment |
|---|---|---|
| Plain muTransfer, dense-only | 1× small dense sweep → dense-1T | baseline — only the dense candidate is covered |
| Plain muTransfer, three architectures separately | 3× small sweeps (dense + MoE-A + MoE-B) | ~3× cost; nothing transfers across architecture (illustrative) |
| Complete-muE, three architectures | 1× small dense sweep + 2 bridge applications | bridges are closed-form, ~free (setup-dependent) |
| Bridged-vs-direct η drift | minor drift, language + diffusion experiments | same bar muTransfer itself targets (reported by Peng et al.) |
A small caveat: Complete-muE is a transfer claim, not a guarantee that any single set of η values is universally optimal. The bridges are derived for the muP family — they assume the team has already adopted muP-style parameterization across the dense and MoE codepaths, and that the small dense reference is itself well-tuned. Teams running non-muP setups (default PyTorch init, fixed Adam β values, etc.) get no benefit. The paper is careful about this scope; the framing here matches it. The deeper lesson is what the bridges make legible: a dense → MoE architecture change can be expressed as two specific parameter-space transformations rather than as "everything is different now." Other architectural changes — for example, grouped attention variants or hybrid dense-sparse layers — might admit similar bridges once someone writes them down.
Goes deeper in: LLM Internals → Transformer Block → FFN
Related explainers
- MSSP — Scale-stable parameterization beyond muP — the alternative-parameterization line of work that Complete-muE explicitly contrasts with
- IBM Granite 4.1 — 8B dense matches the prior 32B MoE — concrete dense-vs-MoE compute tradeoff this paper aims to make cheaper to navigate
- TFGN — Subspace-preserving updates for continual pre-training — another "carry hyperparameters across an architectural change" line of work