The news. On June 11, 2026, a paper titled Manifold Power Iteration (MPI) proposed a redesign of the router inside a Mixture-of-Experts model. Instead of letting each router row be learned freely, MPI rotates every row to align with the principal singular direction of its expert's weight matrix, so the router "points at" what its expert actually computes. Because it only reshapes the router's existing weights, it costs about 0.2% extra training time and zero inference overhead — a gain the authors report across 1B, 3B, and 11B-parameter models. Read the paper →

Picture a theater lighting crew. Each actor on stage has one spot where they perform best, and a row of spotlights overhead decides which actor gets lit for every cue. If a spotlight is aimed by guesswork it lands between actors — the cue goes to nobody useful — and one over-bright lamp washes out the rest. Manifold Power Iteration is the crew swivelling every spotlight until it locks onto its actor's best spot, then trimming them all to the same brightness so none drowns the others. In a Mixture-of-Experts layer, the actors are the experts, the spotlights are the router's rows, and "the actor's best spot" is the kind of input each expert is built to handle.

Underneath the metaphor, a MoE layer replaces the single dense feed-forward network with many expert FFNs plus a small router — a matrix with one row per expert. For each token the router scores every row and the top few experts fire. The catch the paper names is that those router rows are learned from scratch, so a row can drift to point nowhere near what its expert is actually good at — tokens land on the wrong specialist, capacity is wasted, and the load across experts skews.

Every expert, though, already advertises its specialty. A weight matrix has a dominant direction — the direction in input space it amplifies most, named by its top singular vector. MPI's one idea is that a router row should simply point along that direction. Its power-then-retract step does exactly that: a single power-iteration update rotates each router row toward its expert's top singular direction, then an L2 retraction renormalizes the row to a fixed length so its magnitude can't blow up. The authors frame this as steepest-ascent on the router–expert match under a minimal-change constraint — and because it only reshapes rows the router already had, it adds no parameters and no inference cost.

Setting the router's rowsHow a row is chosenExtra inference costWhat it tracks
Vanilla learned routerlearned freely by gradient descent, like any other weightnone (baseline)the routing loss only
MPI-aligned router (paper)rotated onto its expert's top singular direction, then renormalizednone — same matmul shapewhat the expert computes, plus a more even load

Where it earns its keep

The win is the price. Take the paper's largest test: an 11-billion-parameter model with 256 experts, 470 million of whose parameters fire per token, pretrained on 350 billion tokens at a baseline of 34.97 billion tokens/day — about 10 days on the cluster. MPI slows training by 0.2%, and two-tenths of a percent of a ~10-day run is roughly 29 minutes. At inference the router is the same 256-row matmul it always was — MPI only rotated rows it already had — so the per-token cost is unchanged: 470M activated, zero overhead. You pay ~29 extra minutes once and bank an estimated ~0.013 lower pretraining loss that the paper reports holds from 1B all the way up to 11B.

A caveat worth stating plainly: alignment only helps when an expert's weight matrix has a clear dominant direction to point at. MPI sharpens where each token is routed; it does not add experts or widen them, so a layer whose experts are already well-separated has less to gain. What the rotation buys is a better starting geometry for the router — and the paper also reports improved MaxVio, the usual load-balancing metric, because rows trimmed to one scale spread tokens more evenly than a few over-eager ones.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network

Related explainers

Continue in trackLLM Internals — Transformer Block: how the feed-forward layer (and its MoE successor) works

Frequently Asked Questions