What is Manifold Power Iteration (MPI)?

MPI is a redesign of the router in a Mixture-of-Experts model (Hugging Face paper 2606.12397). Instead of learning each router row freely, it rotates every row to align with the principal singular direction of its expert's weight matrix — the input direction that expert amplifies most — using a one-step 'power-then-retract' update that only reshapes the router's existing weights. The result is a router that points at what each expert actually computes, at about 0.2% extra training time and zero inference overhead, evaluated on 1B, 3B, and 11B-parameter models.

Why align a router row with its expert's singular direction?

A weight matrix's top singular vector is a compact description of the inputs that matrix responds to most strongly. If a router row points along that direction, the router scores high exactly for the tokens its expert is built to handle, so tokens reach the right specialist and the load across experts evens out (improving the MaxVio balancing metric). A freely learned row carries no such guarantee and can drift away from the expert's actual specialty.

How is MPI different from a normal MoE router?

A normal router learns its rows by gradient descent with no constraint tying them to the experts. MPI keeps the same router shape and the same inference cost but reshapes the router weights with a power-then-retract update: a power-iteration step rotates each row toward its expert's top singular direction, then an L2 retraction renormalizes it to a fixed scale so norms can't explode. The paper reports lower pretraining loss and improved MaxVio load balancing across 1B, 3B, and 11B models, with only ~0.2% extra training cost.

Manifold Power Iteration redesigns MoE routers — Router-to-expert alignment

Jargon

Mixture-of-Experts (MoE): A layer that swaps one big feed-forward network for many smaller "expert" networks plus a router. Only a few experts run per token, so the model holds far more parameters than it spends on any one token.
Router (gating network): The small matrix — one row per expert — that scores how well each expert fits the current token, then sends the token to its top few. The rows are normally learned like any other weight.
Expert: One of the many feed-forward sub-networks in a MoE layer. Each tends to specialize on a slice of inputs; MPI's goal is to make the router aware of that specialty.
Singular vector / principal direction: From the singular value decomposition of a matrix: the top singular vector is the input direction the matrix amplifies most — a compact description of "what this expert is best at."
Power iteration: A classic, cheap way to find a matrix's dominant direction by repeatedly multiplying a vector through it. MPI uses one such step to nudge a router row toward its expert's top singular direction.
L2 retraction: After the rotation, renormalizing the row back to a fixed length (scale C/√N). Without it, repeatedly pushing rows toward big directions would make their norms explode.
MaxVio (load balancing): A standard metric for how unevenly tokens spread across experts — lower is better. A router whose rows all sit at the same scale tends to balance the load instead of overloading a few experts.

The news. On June 11, 2026, a paper titled Manifold Power Iteration (MPI) proposed a redesign of the router inside a Mixture-of-Experts model. Instead of letting each router row be learned freely, MPI rotates every row to align with the principal singular direction of its expert's weight matrix, so the router "points at" what its expert actually computes. Because it only reshapes the router's existing weights, it costs about 0.2% extra training time and zero inference overhead — a gain the authors report across 1B, 3B, and 11B-parameter models. Read the paper →

Picture a theater lighting crew. Each actor on stage has one spot where they perform best, and a row of spotlights overhead decides which actor gets lit for every cue. If a spotlight is aimed by guesswork it lands between actors — the cue goes to nobody useful — and one over-bright lamp washes out the rest. Manifold Power Iteration is the crew swivelling every spotlight until it locks onto its actor's best spot, then trimming them all to the same brightness so none drowns the others. In a Mixture-of-Experts layer, the actors are the experts, the spotlights are the router's rows, and "the actor's best spot" is the kind of input each expert is built to handle.

Underneath the metaphor, a MoE layer replaces the single dense feed-forward network with many expert FFNs plus a small router — a matrix with one row per expert. For each token the router scores every row and the top few experts fire. The catch the paper names is that those router rows are learned from scratch, so a row can drift to point nowhere near what its expert is actually good at — tokens land on the wrong specialist, capacity is wasted, and the load across experts skews.

Every expert, though, already advertises its specialty. A weight matrix has a dominant direction — the direction in input space it amplifies most, named by its top singular vector. MPI's one idea is that a router row should simply point along that direction. Its power-then-retract step does exactly that: a single power-iteration update rotates each router row toward its expert's top singular direction, then an L2 retraction renormalizes the row to a fixed length so its magnitude can't blow up. The authors frame this as steepest-ascent on the router–expert match under a minimal-change constraint — and because it only reshapes rows the router already had, it adds no parameters and no inference cost.

Setting the router's rows	How a row is chosen	Extra inference cost	What it tracks
Vanilla learned router	learned freely by gradient descent, like any other weight	none (baseline)	the routing loss only
MPI-aligned router (paper)	rotated onto its expert's top singular direction, then renormalized	none — same matmul shape	what the expert computes, plus a more even load

Where it earns its keep

The win is the price. Take the paper's largest test: an 11-billion-parameter model with 256 experts, 470 million of whose parameters fire per token, pretrained on 350 billion tokens at a baseline of 34.97 billion tokens/day — about 10 days on the cluster. MPI slows training by 0.2%, and two-tenths of a percent of a ~10-day run is roughly 29 minutes. At inference the router is the same 256-row matmul it always was — MPI only rotated rows it already had — so the per-token cost is unchanged: 470M activated, zero overhead. You pay ~29 extra minutes once and bank an estimated ~0.013 lower pretraining loss that the paper reports holds from 1B all the way up to 11B.

A caveat worth stating plainly: alignment only helps when an expert's weight matrix has a clear dominant direction to point at. MPI sharpens where each token is routed; it does not add experts or widen them, so a layer whose experts are already well-separated has less to gain. What the rotation buys is a better starting geometry for the router — and the paper also reports improved MaxVio, the usual load-balancing metric, because rows trimmed to one scale spread tokens more evenly than a few over-eager ones.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network

Related explainers

dMoE — block-level expert routing — another MoE-router redesign, but it changes routing granularity; MPI changes what each row points at
Chiaroscuro — spectral-entropy token routing — a different answer to "who decides where a token goes": route by a token's spectrum, not a learned row
SigmaScale — learned SVD scaling — the same singular-value machinery (the top directions of a weight matrix) used to compress a model instead of to route through one

Continue in trackLLM Internals — Transformer Block: how the feed-forward layer (and its MoE successor) works

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based