The news. On July 2, 2026, Microsoft published ELDR (Expert-Locality-aware Decode Routing) (arXiv 2607.00466), a router for prefill/decode-disaggregated MoE serving. Its claim: existing routers balance only compute load across decode workers and ignore which experts each request will touch, so identically-loaded workers can have very different latency. By predicting a request’s experts and routing for locality, ELDR cuts median time-per-output-token by 5.9–13.9% across three MoE models and two workloads, on deployments up to 40 GPUs. Read the paper →
Picture a busy coffee bar. Each drink needs a specific handful of syrups, and each barista can only keep a few syrups within reach — anything else means a walk to the back stockroom, and that walk is the whole delay. A naive manager just sends each order to whoever is least busy. It looks balanced, but the least-busy barista often lacks the right syrups, so they keep trudging to the back. Fair on paper, slow in the cup.
That is exactly what a mixture-of-experts model faces during decode. Each decode step must load the weights of every distinct expert the batch activates — the syrups its drinks need — and the more scattered those experts are, the more weight-loading each step pays. A router that only evens out compute load ignores the one thing that actually sets the delay: how many distinct experts a worker’s batch drags in.
Here is ELDR’s move, and it is the whole idea. When a request runs prefill, it already lights up a set of experts — like a first sip that reveals the drink’s flavors. ELDR turns that into an expert signature: a guess of which experts the request will keep using through decode. Offline, a balanced K-means step carves the signature space into locality zones — one per decode worker — so each worker specializes in a cluster of experts while the zones stay evenly sized. Then at serving time, locality-band routing sends each request to the least-loaded worker whose zone matches its signature — the least-busy barista who already keeps those syrups within reach. A small signature cache rides alongside the KV cache, block for block, so a continued request never has to re-guess.
Walk the batches through it (illustrative numbers). Say a decode step’s time is part compute and part weight-loading, and weight-loading is about a third of it. Under load-balancing-only routing, a worker’s decode batch of 12 requests scatters across the model and touches 40 distinct experts, so it reloads all 40. Under expert-locality routing, the same batch is concentrated and touches only about 12 — the counts here are illustrative, but the shape is the point: fewer distinct experts means that weight-loading third shrinks, so each decode step gets shorter. On real MoE models the measured drop is 5.9–13.9% median TPOT.
| Routing strategy | What it balances | What each worker’s batch touches | Decode latency (TPOT) |
|---|---|---|---|
| Least-loaded / load-balancing only | compute load across workers | many, scattered experts → more weight loads per step | baseline |
| ELDR (arXiv 2607.00466) | compute load and expert locality | few, consistent experts → fewer weight loads per step | ~5.9–13.9% lower (measured, up to 40 GPUs) |
Because the routing keeps each worker’s experts concentrated, the decode step reloads fewer weights and the token-to-token gap shrinks — without changing the model or the hardware. The headline is not a new architecture; it is that the router had a lever no one was pulling: two workers can carry the same load yet cost very different latency, and expert locality is what tells them apart.
Goes deeper in: LLM Serving → Prefill/Decode Disaggregation → Full Disaggregation
Related explainers
- SGLang v0.5.14 — LPLB expert-parallel load balancing — a load-balancing approach of the kind ELDR argues is not enough on its own
- AMD ATOM — prefill/decode disaggregation — the disaggregated serving setting ELDR routes inside
- MobileMoE — DRAM-aware MoE scaling — why loading the right experts, not just fewer of them, is the memory cost
- Grouped Query Experts — MoE routing inside attention — another take on where and how experts get selected