MSSP — Scale-stable parameterization for MoE

LLM
L
Scale a model 64× — muP's LR drifts at MoE scale, MSSP's holdsTraining loss · same LR across 4 scalesloss (rel.)training step (relative)muPMSSPFinal-loss spreadmuP4 spreadsMSSP1 point
learnaivisually.com/ai-explained/mssp-vs-mup-moe-scaling

Maximally Scale-Stable Parameterization (MSSP) is the parameterization rule introduced in Maximally Scale-Stable Parameterization for Mixture-of-Experts via Dynamical Mean Field Theory. The paper's diagnostic finding — that muP's scale-stability guarantee leaks at MoE scale because the expert-aggregation step introduces scale-dependent observables muP doesn't cover — is what motivates the extension.

The news. On May 16, 2026, Vankadara, Haas, Hayward, Bordt, and Breccia posted Maximally Scale-Stable Parameterization for Mixture-of-Experts via Dynamical Mean Field Theory. Using DMFT in the infinite-width and infinite-expert limit, they derive scale-stable init and learning-rate rules — MSSP — for both SGD and Adam across three different MoE scaling regimes. They report that MSSP produces qualitatively distinct training dynamics from muP and maintains monotonic improvement as model dimensions grow proportionally. Read the paper →

Picture an oven with a single temperature dial. For years, a tested recipe-scaling rule has told you that doubling, quadrupling, even 10×-ing the portion of one main dish needs no temperature change — bake it the same way. That rule is the kitchen's version of muP, the Maximal Update Parameterization that, in dense neural networks, lets you tune the learning rate at small width and trust it at large width. The MSSP paper's setup is that you now want to share that oven with three or four specialist side dishes — sauces, pastries, baked add-ons — that get added or grown as you scale up. Suddenly the dial that "just worked" for the main dish is wrong: the side dishes draw heat differently, and the right temperature drifts.

Translating back: the "side dishes" are MoE experts, and "drawing heat differently" is the expert-aggregation step the router does at every layer. The MSSP authors apply Dynamical Mean Field Theory — a physics toolkit for the infinite-width and infinite-expert limit — to read off exactly which terms in the training dynamics depend on scale in MoE but not in dense networks. They then prescribe the init and per-layer learning-rate rules that cancel those scale-dependent terms. Crucially they do this for SGD and Adam separately, because the optimizer choice changes which dynamics terms dominate.

The headline claim is that MSSP preserves learning-rate transfer across MoE scaling regimes. Imagine you sweep the learning rate at a small scale — say a few-billion-parameter MoE — and find that around 1e-3 is best. Under muP at MoE scale, the optimum has moved by the time you train your 64×-larger MoE, so the small sweep stopped predicting the large one. Under MSSP, the same 1e-3 is still near-optimal at 64× scale (illustrative — the paper does not publish curves at this specific point; the property is the structural claim). The paper validates this across three different scaling regimes — different rules for how width, expert count, and tokens grow together — and reports that MSSP holds in all three while muP holds in only one.

How MSSP relates to other parameterizations

ParameterizationTarget architectureOptimizers analyzedScale-stability claim (setup-dependent, illustrative)
Standard parameterization (SP)Dense MLP / TransformerSGD, AdamOptimal LR shifts with width — sweep needs to be repeated
muP (Maximal Update Parameterization)Dense MLP / TransformerSGD, AdamWidth-stable for dense networks; one of three MoE regimes only
NTK parameterizationInfinite-width denseSGDLinearizes the network — feature learning is reduced
MSSP (this paper)MoE TransformerSGD, Adam (derived separately)Stable across all three MoE scaling regimes the paper studies

The point of the row-by-row contrast is that MSSP is not a replacement for muP — it's the MoE-aware generalization. Dense networks are covered by both muP and MSSP; MoE networks need MSSP because muP's width-only rules don't cover the aggregation dynamics. The paper specifically frames MSSP as "muP for MoE" and reports qualitatively distinct training dynamics from muP at the scales where the two would otherwise be expected to behave the same.

Goes deeper in: LLM Internals → Transformer Block → Feed-forward Network — the FFN is exactly where MoE replaces the dense block with a router-and-experts setup, which is where MSSP's aggregation analysis kicks in.

Related explainers

Frequently Asked Questions