MSSP (Maximally Scale-Stable Parameterization) is a set of initialization and learning-rate scaling rules for Mixture-of-Experts neural networks, introduced in the paper 'Maximally Scale-Stable Parameterization for Mixture-of-Experts via Dynamical Mean Field Theory' by Vankadara, Haas, Hayward, Bordt, and Breccia. The authors apply DMFT — a physics-derived toolkit for the infinite-width / infinite-expert limit — to MoE training dynamics, identify the scale-dependent terms in the expert-aggregation step that muP doesn't cover, and prescribe new init and per-layer learning-rate rules that cancel them. They derive MSSP separately for SGD and Adam and report that it preserves learning-rate transfer across three different MoE scaling regimes, while muP holds in only one of them.

Why doesn't muP transfer to MoE?

muP (Maximal Update Parameterization) was derived for dense networks, where the scale-dependent terms in the training dynamics come from each layer's width — the matrix-multiply step. MoE adds an extra step on top: the router-weighted aggregation of expert outputs has its own scale-dependent observables, with magnitudes that depend on width and expert count jointly. The MSSP paper's DMFT analysis reads off exactly what those terms are. muP's width-only rules don't cancel them, so when you scale a dense and an MoE model up by the same factor, the dense one's optimal learning rate stays put while the MoE one drifts. MSSP's prescription is what brings the MoE case back to muP-style scale stability.

How does MSSP relate to muP and DMFT?

muP is the parameterization MSSP generalizes; DMFT is the analytic toolkit it is derived with. Dynamical Mean Field Theory is a physics framework for the infinite-width / infinite-particle limit of stochastic systems, which the authors port to MoE training — it gives a closed-form description of how aggregated quantities evolve and lets them isolate which terms are scale-dependent. The MSSP rules are the prescription that cancels those terms; muP is the special case where there are no specialist expert sub-blocks in the model, so the aggregation step disappears. Practically, MSSP positions itself as 'muP for MoE': the recipe you'd reach for when you want a small-scale learning-rate sweep to predict the large-scale optimum on an MoE model.

MSSP paper — Scale-stable parameterization beyond muP

MSSP — Scale-stable parameterization for MoE

LLM

learnaivisually.com/ai-explained/mssp-vs-mup-moe-scaling

TL;DR

What is it: The MSSP paper introduces Maximally Scale-Stable Parameterization — initialization and learning-rate scaling rules for Mixture-of-Experts training that, like muP for dense networks, are designed to keep the optimal learning rate stable as the model grows.
Why it’s needed: Scaling up an MoE model is expensive enough that you really don't want to re-sweep the learning rate every time you change width, expert count, or token budget. Learning-rate transfer is the property that makes the small-scale tuning sweep predict the large-scale one — without it, every scale-up is a re-tuning project.
vs previous: Plain muP was derived for dense networks; the MSSP authors show it doesn't fully transfer to MoE because the expert-aggregation dynamics introduce scale-dependent terms muP doesn't account for. MSSP is derived from a Dynamical Mean Field Theory analysis of MoE and keeps the LR transfer property across three different MoE scaling regimes.

Jargon

muP: Maximal Update Parameterization. A set of init and per-layer learning-rate rules for dense neural networks chosen so that, in the infinite-width limit, the optimal learning rate stops shifting as you make the model wider. Practically: tune at small width, transfer the recipe to large width without re-tuning. MSSP positions itself as muP's MoE-aware successor.
MoE: Mixture-of-Experts. A layer where, instead of one large feed-forward block, you have many smaller "expert" blocks and a router that sends each token to a small subset. Inference and training compute scale with the active experts, not the total parameter count. More on dense vs MoE →
DMFT: Dynamical Mean Field Theory. A physics-derived analytic toolkit for the infinite-width / infinite-particle limit of stochastic systems. Applied to training dynamics, it gives closed-form descriptions of how aggregated quantities (like the per-layer activation norm) evolve step by step. MSSP uses DMFT on the MoE aggregation to read off the scale-dependent terms.
Learning-rate transfer: The property that an optimal learning rate found at small scale stays optimal at larger scale, so the small-scale sweep predicts the big-scale one. muP guarantees this for width in dense networks; MSSP's claim is that it holds for width and expert count in MoE.
Scaling regime: A choice of how you grow width N, expert count E, and token budget T together as you scale up the model. The MSSP paper studies three different MoE scaling regimes and shows MSSP holds across all three; muP holds in only one of them.
Aggregation dynamics: How the router-weighted sum of expert outputs behaves under stochastic training. The paper's central technical result is that scale-dependent observables in the aggregation dynamics are what cause muP's width-only rules to break down at MoE scale.
SGD / Adam: The two standard optimizers MSSP is derived for. SGD updates weights by the raw gradient; Adam normalizes per-parameter by a running estimate of gradient magnitude. The paper presents the MSSP rules separately for each, because the optimizer changes which terms in the dynamics dominate at scale.

Maximally Scale-Stable Parameterization (MSSP) is the parameterization rule introduced in Maximally Scale-Stable Parameterization for Mixture-of-Experts via Dynamical Mean Field Theory. The paper's diagnostic finding — that muP's scale-stability guarantee leaks at MoE scale because the expert-aggregation step introduces scale-dependent observables muP doesn't cover — is what motivates the extension.

The news. On May 16, 2026, Vankadara, Haas, Hayward, Bordt, and Breccia posted Maximally Scale-Stable Parameterization for Mixture-of-Experts via Dynamical Mean Field Theory. Using DMFT in the infinite-width and infinite-expert limit, they derive scale-stable init and learning-rate rules — MSSP — for both SGD and Adam across three different MoE scaling regimes. They report that MSSP produces qualitatively distinct training dynamics from muP and maintains monotonic improvement as model dimensions grow proportionally. Read the paper →

Picture an oven with a single temperature dial. For years, a tested recipe-scaling rule has told you that doubling, quadrupling, even 10×-ing the portion of one main dish needs no temperature change — bake it the same way. That rule is the kitchen's version of muP, the Maximal Update Parameterization that, in dense neural networks, lets you tune the learning rate at small width and trust it at large width. The MSSP paper's setup is that you now want to share that oven with three or four specialist side dishes — sauces, pastries, baked add-ons — that get added or grown as you scale up. Suddenly the dial that "just worked" for the main dish is wrong: the side dishes draw heat differently, and the right temperature drifts.

Translating back: the "side dishes" are MoE experts, and "drawing heat differently" is the expert-aggregation step the router does at every layer. The MSSP authors apply Dynamical Mean Field Theory — a physics toolkit for the infinite-width and infinite-expert limit — to read off exactly which terms in the training dynamics depend on scale in MoE but not in dense networks. They then prescribe the init and per-layer learning-rate rules that cancel those scale-dependent terms. Crucially they do this for SGD and Adam separately, because the optimizer choice changes which dynamics terms dominate.

The headline claim is that MSSP preserves learning-rate transfer across MoE scaling regimes. Imagine you sweep the learning rate at a small scale — say a few-billion-parameter MoE — and find that around 1e-3 is best. Under muP at MoE scale, the optimum has moved by the time you train your 64×-larger MoE, so the small sweep stopped predicting the large one. Under MSSP, the same 1e-3 is still near-optimal at 64× scale (illustrative — the paper does not publish curves at this specific point; the property is the structural claim). The paper validates this across three different scaling regimes — different rules for how width, expert count, and tokens grow together — and reports that MSSP holds in all three while muP holds in only one.

How MSSP relates to other parameterizations

Parameterization	Target architecture	Optimizers analyzed	Scale-stability claim (setup-dependent, illustrative)
Standard parameterization (SP)	Dense MLP / Transformer	SGD, Adam	Optimal LR shifts with width — sweep needs to be repeated
muP (Maximal Update Parameterization)	Dense MLP / Transformer	SGD, Adam	Width-stable for dense networks; one of three MoE regimes only
NTK parameterization	Infinite-width dense	SGD	Linearizes the network — feature learning is reduced
MSSP (this paper)	MoE Transformer	SGD, Adam (derived separately)	Stable across all three MoE scaling regimes the paper studies

The point of the row-by-row contrast is that MSSP is not a replacement for muP — it's the MoE-aware generalization. Dense networks are covered by both muP and MSSP; MoE networks need MSSP because muP's width-only rules don't cover the aggregation dynamics. The paper specifically frames MSSP as "muP for MoE" and reports qualitatively distinct training dynamics from muP at the scales where the two would otherwise be expected to behave the same.

Goes deeper in: LLM Internals → Transformer Block → Feed-forward Network — the FFN is exactly where MoE replaces the dense block with a router-and-experts setup, which is where MSSP's aggregation analysis kicks in.

Related explainers

IBM Granite 4.1 — Dense vs MoE families — the architectural contrast MSSP is parameterizing
NVIDIA Nemotron 3 — Multimodal MoE — a recent MoE model where this kind of scaling discipline matters
ZeDA paper — Zero-output expert distillation — a different angle on what can go subtly wrong with MoE experts at scale

MSSP — Scale-stable parameterization for MoE

How MSSP relates to other parameterizations

Related explainers

Frequently Asked Questions