The news. On June 16, 2026, researchers posted SoftMoE, accepted at ICML 2026. It targets a quiet awkwardness in standard sparse Mixture-of-Experts models: the router picks experts with a hard top-k, which has no gradient, so the router can only be trained indirectly through extra balancing losses. SoftMoE replaces that hard pick with a truncated soft top-k built on a smooth relaxation (LapSum), making routing differentiable end to end. A global budget then lets the model learn how many experts to activate per layer — and the learned allocation comes out non-uniform, with later layers using more experts than earlier ones. Read the paper →
Picture the wall of a control room with a light for every expert. The router's job is to turn on a few lights for each token and leave the rest dark. Standard Mixture-of-Experts wires those lights to on/off switches: the top few experts flip fully on, every other expert is exactly off. It works — but now imagine you are trying to tune that wall to light the room better. With on/off switches there is no in-between: you flip a switch and the light is either there or not. You can't feel whether nudging one switch a hair would have helped, because a switch has no "a little." That missing "a little" is the whole problem SoftMoE attacks.
In machine-learning terms, the "a little" is a gradient. Training works by nudging every weight in the direction that lowers the loss, and that needs a slope to follow. A hard top-k pick is an argmax — the same all-or-nothing choice as greedy decoding — and an argmax is flat almost everywhere with a sudden jump at the tie. So the router gets no gradient from its own decision: raising a near-miss expert's score by a hair changes nothing until it abruptly crosses the cutoff and snaps in. Because the router can't learn from the task loss directly, standard sparse MoEs add auxiliary load-balancing losses — extra, hand-tuned objectives whose only job is to stop the router from collapsing onto a handful of experts. The expert math is fine; it is the choosing that won't train.
SoftMoE swaps the on/off switches for dimmer dials. Its soft top-k — built on a smooth relaxation the authors call LapSum — still concentrates most of the weight on a few experts, but every expert now gets a continuous weight instead of a 0-or-1 verdict. A dial has a slope: turn it slightly and the light changes slightly, so you can finally feel which way to adjust. Concretely, the routing function becomes differentiable, so the gradient of the task loss flows straight through the routing decision and tells the router "lean a bit more on expert 12 here." A global budget — like a softened distribution under a temperature — caps the total wattage the dials may draw, which is what keeps the layer sparse; but within that budget the model now learns its own allocation rather than obeying a hand-set rule.
Where the gradient comes from
Take one layer with 64 experts and a budget of about 4 active per token (illustrative — the paper fixes neither). Under hard top-k the router scores all 64 and keeps ranks 1–4 at full weight; ranks 5–64 get exactly 0. Ask the trainer the key question — if I raise expert #5's score by a tiny ε, how much does its weight change? The answer is 0: #5 is still rank 5, still weight 0, right up until it crosses #4 and snaps from 0 to full. The derivative is zero almost everywhere and undefined at the jump — no gradient, so the router can't be taught what to prefer. Now switch to soft top-k with the same budget. Every expert gets a smooth weight summing to ~4 — say [0.97, 0.92, 0.85, 0.55, 0.4, 0.18, 0.08, …] over the contenders. Raise #5's score by ε and its weight rises smoothly, so ∂(loss)/∂(router score) is now nonzero and points somewhere. The layer still spends ~4 experts of compute — the budget is unchanged — but the routing decision finally has a slope to train on. That is the entire trade: same sparsity, a gradient where there used to be none.
| Routing scheme | An expert's weight | Gradient through routing | How the router is trained |
|---|---|---|---|
| Hard top-k (standard MoE) | 0 or full (a hard cutoff) | Zero almost everywhere — an argmax | Indirectly, via hand-tuned auxiliary load-balancing losses |
| Soft top-k (SoftMoE) | Continuous, concentrated on a few (LapSum) | Nonzero — differentiable end to end | Directly from the task loss, under a learned budget (ICML 2026) |
The deeper pattern is one worth carrying: a discrete choice can't be trained, but a smooth relaxation of it can — the same move that powers tricks like Gumbel-softmax and straight-through estimators across deep learning, applied here to expert routing. Once the choice is differentiable, the budget stops being a fixed rule and becomes something the network optimizes, which is why SoftMoE's learned allocation is non-uniform — later layers choosing to spread across more experts than early ones. The paper reports that this matches or beats sparse MoE with fewer active experts, though as an ICML 2026 result on the authors' own models the honest reading is "a cleaner, trainable router," not yet a settled production default — the win depends on how much your model was paying for that hand-tuned balancing in the first place.
Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network
Related explainers
- dMoE — block-level expert routing — a different angle on the router: cutting how many distinct experts a decoded block has to load
- ZEDA — zero-output expert self-distillation — teaches the router to skip experts entirely, another way to bend the routing decision
- Manifold Power Iteration — router-to-expert alignment — redesigns the MoE router from the alignment side rather than the differentiability side