What is Complete-muE?

Complete-muE is a two-bridge extension of muTransfer (the practical workflow built on muP) that lets a single dense-FFN hyperparameter sweep transfer near-optimally to any Mixture-of-Experts configuration. The two bridges are active-width scaling — which maps dense hyperparameters to an MoE layer with a different per-token activated width — and activated-expert scaling — which traverses across different MoE shapes (expert count E and top-k routing). The paper (Peng et al., May 2026) reports small hyperparameter drift across the bridges in both language-model and diffusion-model pretraining experiments.

Why does Complete-muE matter for large-scale training?

Hyperparameter sweeps are the most expensive single line item in trillion-parameter pretraining budgets. Plain muTransfer already saves teams from re-sweeping when they scale a model's width within one architecture family; what Complete-muE adds is that the transfer survives the architectural switch from dense to sparse-expert. Teams that move from a dense baseline to an MoE mid-roadmap — or that want to compare several MoE shapes before committing — no longer have to run a separate small-model sweep for each candidate. One dense tune plus the two bridges covers the whole space.

How does Complete-muE differ from MSSP?

MSSP proposes an alternative parameterization to muP — it changes how the model is initialized and scaled, displacing muP rather than extending it. Complete-muE keeps muP and adds two parameter-space bridges on top, so it composes with the existing muTransfer playbook teams are already running. The two papers attack the same gap (muP's silence about MoE) from opposite directions: MSSP rebuilds the foundation, Complete-muE adds adapters that compose with the existing playbook.

Complete-muE paper — Two-bridge muTransfer for MoE

Complete-muE — Two-bridge muTransfer for MoE

LLM

learnaivisually.com/ai-explained/complete-mue-two-bridge-mutransfer

TL;DR

What is it: The Complete-muE paper proposes a two-bridge extension of muTransfer that lets a single dense-FFN hyperparameter sweep transfer near-optimally to any Mixture-of-Experts configuration — different expert counts, different top-k routing, different activated widths. The two bridges are active-width scaling (dense ↔ activated width of an MoE layer) and activated-expert scaling (across MoE shapes).
Why it’s needed: Hyperparameter sweeps at trillion-parameter scale are the most expensive part of pretraining. muP/muTransfer already let teams tune a small model and transfer to a larger one of the same shape; what Complete-muE adds is that the transfer survives an architectural change from dense to sparse-expert, not just a width change. Teams that move dense → MoE mid-roadmap no longer have to re-sweep from scratch.
vs previous: Plain muTransfer only transfers across width within one architecture family — it breaks when you switch FFN to an MoE layer with a routed top-k. MSSP (a recent alternative) replaces muP with a different parameterization; Complete-muE instead extends muP by adding two parameter-space bridges that map dense hyperparameters to the equivalent MoE hyperparameters without retuning.

Jargon

muP: Maximal Update Parameterization — a width-invariant way to initialize and learning-rate-scale a model so that the loss curve at width W matches the loss curve at width 2W, 4W, and so on. Without muP, the optimal learning rate drifts as you scale; with muP, it stays roughly constant.
muTransfer: The practical workflow muP enables — tune learning rate, init scale, and a handful of other hyperparameters on a small model, then transfer them unchanged to a large model in the same family. Saves the expensive large-model sweep.
Active width: In an MoE layer, the per-token width the model actually computes — typically expert width × top-k. An MoE with 16 experts of width W/4 at top-4 has an active width of W; with top-2, an active width of W/2.
Top-k routing: The router that picks the k experts per token the MoE layer will run. k=1 is one expert per token (e.g. Switch Transformer); k=2 or k=4 is more common in modern MoEs. Higher k means more active compute per token but more capacity per token too.
Active-width scaling bridge: Complete-muE's first bridge. Maps the dense FFN's hyperparameters to the equivalent hyperparameters for an MoE layer whose per-token active width differs from the dense width. Effectively a width adjustment within the muP framework, applied across architectures.
Activated-expert scaling bridge: Complete-muE's second bridge. Holds active width constant and traverses across MoE shapes — different expert counts E, different top-k. Captures the part of the gap that pure muP/active-width cannot.
MoE: Mixture-of-Experts — a layer that replaces a single FFN with E parallel FFN "experts" plus a router that sends each token to the top-k of them. Total parameters grow with E; active compute per token grows only with k. See also IBM Granite 4.1's 8B-dense ≈ 32B-MoE for the dense/MoE parameter-vs-compute tradeoff.

The news. On May 22, 2026, Peng et al. posted Complete-muE on arXiv — a two-bridge framework for hyperparameter transfer between dense FFNs and Mixture-of-Experts layers. The paper reports that pretraining experiments on both language and diffusion models confirm small hyperparameter drift across the bridges: tune once on a dense reference, and the same learning rate, initialization, and warmup settings transfer near-optimally to any MoE configuration the team might choose downstream.

Picture a chef who has spent a month perfecting one master recipe for a single kitchen — burners arranged a particular way, one big pot, a fixed mise en place. Hand them a kitchen with eight smaller burners and four pots running in parallel, and the recipe needs to change: the same butter quantity will burn at the new heat profile, the same simmer time will overshoot, the same salt amount will dominate the smaller portions. The chef has two choices. Either retest every ingredient from scratch in the new kitchen — expensive — or carry two conversion charts that translate the master recipe to whatever burner-and-pot configuration the next kitchen happens to have. Complete-muE is the second option for MoE training.

The mechanism is straightforward once the metaphor sits still. The team perfects the recipe on a dense FFN (one big "pot" of width W) using ordinary muP — learning rate, initialization, warmup all sweep on a small dense model and transfer to a larger dense one. So far this is the standard muTransfer playbook teams already run. The new contribution is what happens when the architecture itself changes — when the next training run swaps the dense FFN for an MoE layer with E experts and top-k routing.

The first bridge, active-width scaling, handles the case where the MoE's per-token activated width differs from the dense width. An MoE with 8 experts of width W/8 at top-2 has an active width of W/4 — a quarter of the dense reference's per-token compute. Plain muTransfer is built to handle width changes within one family, so the active-width bridge applies that same width-invariance idea across the architecture boundary from dense to sparse-expert.

The second bridge, activated-expert scaling, is the genuinely new piece. Hold active width constant — say, both configs run at active width W/4 — but vary the expert count E and top-k. An MoE at (E=8, k=2, expert width W/8) and one at (E=16, k=4, expert width W/16) both land at active width W/4, yet they behave differently because the router's load balance, the per-expert effective batch size, and the gradient signal each expert sees all shift with E and k. Activated-expert scaling captures that shift as a parameter-space mapping, so hyperparameters tuned at (E=8, k=2) transfer cleanly to (E=16, k=4) without a fresh sweep. One dense tune plus two bridges covers any MoE in the family.

Where Complete-muE earns its keep

Hold three variables fixed and walk the arithmetic. One team. One trillion-parameter MoE training budget. Three architecture candidates under consideration — a dense 1T baseline, an MoE-A at (E=8, k=2), and an MoE-B at (E=16, k=4). The team needs to pick the best learning rate η for each candidate before committing to the full run. Without Complete-muE, a defensible sweep is three independent muTransfer studies: tune η on small-dense → transfer to dense-1T, tune η on small-MoE-A → transfer to MoE-A-1T, tune η on small-MoE-B → transfer to MoE-B-1T. Each study runs roughly 8 candidate η values at the small scale (illustrative — exact sweep size depends on the team's prior). Total: 3 × 8 = 24 small sweeps.

With Complete-muE, the team runs one small-dense sweep (~8 candidate η values), then applies the two bridges to derive η for both MoE candidates. Total small sweeps: 8. The bridges themselves are cheap — they are closed-form parameter-space mappings, not extra runs. The saving is roughly 3× small-sweep compute for the cost of writing the bridges down once. The paper's headline experimental result is that the bridged hyperparameters show only minor drift from a directly-tuned sweep on both the language-model and the diffusion-model pretraining experiments the authors report — the same bar muTransfer itself targets.

Setting	Sweeps run	Comment
Plain muTransfer, dense-only	1× small dense sweep → dense-1T	baseline — only the dense candidate is covered
Plain muTransfer, three architectures separately	3× small sweeps (dense + MoE-A + MoE-B)	~3× cost; nothing transfers across architecture (illustrative)
Complete-muE, three architectures	1× small dense sweep + 2 bridge applications	bridges are closed-form, ~free (setup-dependent)
Bridged-vs-direct η drift	minor drift, language + diffusion experiments	same bar muTransfer itself targets (reported by Peng et al.)

A small caveat: Complete-muE is a transfer claim, not a guarantee that any single set of η values is universally optimal. The bridges are derived for the muP family — they assume the team has already adopted muP-style parameterization across the dense and MoE codepaths, and that the small dense reference is itself well-tuned. Teams running non-muP setups (default PyTorch init, fixed Adam β values, etc.) get no benefit. The paper is careful about this scope; the framing here matches it. The deeper lesson is what the bridges make legible: a dense → MoE architecture change can be expressed as two specific parameter-space transformations rather than as "everything is different now." Other architectural changes — for example, grouped attention variants or hybrid dense-sparse layers — might admit similar bridges once someone writes them down.

Goes deeper in: LLM Internals → Transformer Block → FFN

Related explainers

MSSP — Scale-stable parameterization beyond muP — the alternative-parameterization line of work that Complete-muE explicitly contrasts with
IBM Granite 4.1 — 8B dense matches the prior 32B MoE — concrete dense-vs-MoE compute tradeoff this paper aims to make cheaper to navigate
TFGN — Subspace-preserving updates for continual pre-training — another "carry hyperparameters across an architectural change" line of work

Continue in trackTransformer Block: FFN

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based