What is differentiable soft top-k routing?

It is the routing scheme in the SoftMoE paper (ICML 2026) for Mixture-of-Experts models. Instead of a hard top-k that turns a few experts fully on and the rest off, SoftMoE gives every expert a continuous weight using a smooth relaxation called LapSum, while still concentrating most weight on a few experts. Because the routing function now has a slope, gradients from the task loss flow through it and the router can be trained directly, under a budget that controls the average number of experts per token.

How is SoftMoE different from standard MoE routing?

Standard MoE routing is a hard top-k: an expert is fully in or fully out, and the model needs auxiliary load-balancing losses to keep the router from collapsing. SoftMoE keeps the same experts and roughly the same compute budget but makes the routing decision differentiable — every expert gets a continuous, concentrated weight via the LapSum relaxation. A global budget constraint becomes a learnable allocation, and the paper finds the model learns to use more experts in its later layers than its earlier ones.

SoftMoE replaces top-k expert routing — Differentiable soft top-k routing

Q: Why can't a hard top-k router be trained directly?

Because picking the top-k experts is an argmax — a hard, all-or-nothing choice. Raising a near-miss expert's score a little does not change its weight (it stays at zero) until it abruptly crosses the cutoff, so the derivative is zero almost everywhere and undefined at the jump. With no gradient of its own, the router has to be trained indirectly through hand-tuned auxiliary load-balancing losses. SoftMoE's smooth soft top-k restores the slope, so the router learns from the loss directly.

TL;DR

What is it: The news is SoftMoE, an ICML 2026 paper that replaces a mixture-of-experts model's hard top-k router with a differentiable soft top-k — so the routing decision can be trained by gradient descent like the rest of the network.
Why it’s needed: The router that decides which experts run is the control knob of a sparse MoE, but a hard pick gives that router no gradient of its own. Models prop it up with hand-tuned auxiliary load-balancing losses; making routing differentiable lets it learn straight from the task loss instead.
vs previous: Against standard top-k routing — a hard argmax with zero gradient almost everywhere, propped up by extra balancing losses — SoftMoE uses a smooth LapSum relaxation that still concentrates weight on a few experts but is differentiable end to end, plus a budget that lets the model learn how many experts to use per layer.

Jargon

MoE (Mixture of Experts): A layer whose feed-forward block is split into many expert sub-networks. A small router activates only a few experts per token, so the model holds huge total capacity but runs only a slice of it each step.
Router: The tiny network that scores every expert for a token and decides which ones run. It is the part SoftMoE redesigns — the rest of the experts are unchanged.
Top-k routing: The standard scheme: score all experts, keep the top-k (a small fixed number), run only those. It is a hard choice — an expert is either fully in or fully out, like greedy decoding's argmax.
Gradient / differentiable: Training nudges every weight in the direction that lowers the loss — it needs a slope (a gradient). A function with a flat top and a sudden jump (a hard pick) has no usable slope, so you can't train through it directly.
Soft top-k (LapSum): SoftMoE's replacement: a smooth function (built on a relaxation the authors call LapSum) that still concentrates weight on a few experts but gives every expert a continuous weight — so it has a slope and gradients can flow through it.
Auxiliary load-balancing loss: An extra, hand-designed training objective bolted onto MoEs to stop the router collapsing onto a few experts. It exists because the hard router has no gradient of its own; soft routing reduces the need for it.
Experts-per-token budget: A global constraint on the average number of experts each token activates. SoftMoE makes it a learnable allocation — the model decides per layer, and the paper finds later layers learn to use more.

The news. On June 16, 2026, researchers posted SoftMoE, accepted at ICML 2026. It targets a quiet awkwardness in standard sparse Mixture-of-Experts models: the router picks experts with a hard top-k, which has no gradient, so the router can only be trained indirectly through extra balancing losses. SoftMoE replaces that hard pick with a truncated soft top-k built on a smooth relaxation (LapSum), making routing differentiable end to end. A global budget then lets the model learn how many experts to activate per layer — and the learned allocation comes out non-uniform, with later layers using more experts than earlier ones. Read the paper →

Picture the wall of a control room with a light for every expert. The router's job is to turn on a few lights for each token and leave the rest dark. Standard Mixture-of-Experts wires those lights to on/off switches: the top few experts flip fully on, every other expert is exactly off. It works — but now imagine you are trying to tune that wall to light the room better. With on/off switches there is no in-between: you flip a switch and the light is either there or not. You can't feel whether nudging one switch a hair would have helped, because a switch has no "a little." That missing "a little" is the whole problem SoftMoE attacks.

In machine-learning terms, the "a little" is a gradient. Training works by nudging every weight in the direction that lowers the loss, and that needs a slope to follow. A hard top-k pick is an argmax — the same all-or-nothing choice as greedy decoding — and an argmax is flat almost everywhere with a sudden jump at the tie. So the router gets no gradient from its own decision: raising a near-miss expert's score by a hair changes nothing until it abruptly crosses the cutoff and snaps in. Because the router can't learn from the task loss directly, standard sparse MoEs add auxiliary load-balancing losses — extra, hand-tuned objectives whose only job is to stop the router from collapsing onto a handful of experts. The expert math is fine; it is the choosing that won't train.

SoftMoE swaps the on/off switches for dimmer dials. Its soft top-k — built on a smooth relaxation the authors call LapSum — still concentrates most of the weight on a few experts, but every expert now gets a continuous weight instead of a 0-or-1 verdict. A dial has a slope: turn it slightly and the light changes slightly, so you can finally feel which way to adjust. Concretely, the routing function becomes differentiable, so the gradient of the task loss flows straight through the routing decision and tells the router "lean a bit more on expert 12 here." A global budget — like a softened distribution under a temperature — caps the total wattage the dials may draw, which is what keeps the layer sparse; but within that budget the model now learns its own allocation rather than obeying a hand-set rule.

Where the gradient comes from

Take one layer with 64 experts and a budget of about 4 active per token (illustrative — the paper fixes neither). Under hard top-k the router scores all 64 and keeps ranks 1–4 at full weight; ranks 5–64 get exactly 0. Ask the trainer the key question — if I raise expert #5's score by a tiny ε, how much does its weight change? The answer is 0: #5 is still rank 5, still weight 0, right up until it crosses #4 and snaps from 0 to full. The derivative is zero almost everywhere and undefined at the jump — no gradient, so the router can't be taught what to prefer. Now switch to soft top-k with the same budget. Every expert gets a smooth weight summing to ~4 — say [0.97, 0.92, 0.85, 0.55, 0.4, 0.18, 0.08, …] over the contenders. Raise #5's score by ε and its weight rises smoothly, so ∂(loss)/∂(router score) is now nonzero and points somewhere. The layer still spends ~4 experts of compute — the budget is unchanged — but the routing decision finally has a slope to train on. That is the entire trade: same sparsity, a gradient where there used to be none.

Routing scheme	An expert's weight	Gradient through routing	How the router is trained
Hard top-k (standard MoE)	0 or full (a hard cutoff)	Zero almost everywhere — an argmax	Indirectly, via hand-tuned auxiliary load-balancing losses
Soft top-k (SoftMoE)	Continuous, concentrated on a few (LapSum)	Nonzero — differentiable end to end	Directly from the task loss, under a learned budget (ICML 2026)

The deeper pattern is one worth carrying: a discrete choice can't be trained, but a smooth relaxation of it can — the same move that powers tricks like Gumbel-softmax and straight-through estimators across deep learning, applied here to expert routing. Once the choice is differentiable, the budget stops being a fixed rule and becomes something the network optimizes, which is why SoftMoE's learned allocation is non-uniform — later layers choosing to spread across more experts than early ones. The paper reports that this matches or beats sparse MoE with fewer active experts, though as an ICML 2026 result on the authors' own models the honest reading is "a cleaner, trainable router," not yet a settled production default — the win depends on how much your model was paying for that hand-tuned balancing in the first place.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network

Related explainers

dMoE — block-level expert routing — a different angle on the router: cutting how many distinct experts a decoded block has to load
ZEDA — zero-output expert self-distillation — teaches the router to skip experts entirely, another way to bend the routing decision
Manifold Power Iteration — router-to-expert alignment — redesigns the MoE router from the alignment side rather than the differentiability side

Continue in trackLLM Internals: The Feed-Forward Network

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based