The news. On July 2, 2026, Microsoft published ELDR (Expert-Locality-aware Decode Routing) (arXiv 2607.00466), a router for prefill/decode-disaggregated MoE serving. Its claim: existing routers balance only compute load across decode workers and ignore which experts each request will touch, so identically-loaded workers can have very different latency. By predicting a request’s experts and routing for locality, ELDR cuts median time-per-output-token by 5.9–13.9% across three MoE models and two workloads, on deployments up to 40 GPUs. Read the paper →

Picture a busy coffee bar. Each drink needs a specific handful of syrups, and each barista can only keep a few syrups within reach — anything else means a walk to the back stockroom, and that walk is the whole delay. A naive manager just sends each order to whoever is least busy. It looks balanced, but the least-busy barista often lacks the right syrups, so they keep trudging to the back. Fair on paper, slow in the cup.

That is exactly what a mixture-of-experts model faces during decode. Each decode step must load the weights of every distinct expert the batch activates — the syrups its drinks need — and the more scattered those experts are, the more weight-loading each step pays. A router that only evens out compute load ignores the one thing that actually sets the delay: how many distinct experts a worker’s batch drags in.

Here is ELDR’s move, and it is the whole idea. When a request runs prefill, it already lights up a set of experts — like a first sip that reveals the drink’s flavors. ELDR turns that into an expert signature: a guess of which experts the request will keep using through decode. Offline, a balanced K-means step carves the signature space into locality zones — one per decode worker — so each worker specializes in a cluster of experts while the zones stay evenly sized. Then at serving time, locality-band routing sends each request to the least-loaded worker whose zone matches its signature — the least-busy barista who already keeps those syrups within reach. A small signature cache rides alongside the KV cache, block for block, so a continued request never has to re-guess.

Walk the batches through it (illustrative numbers). Say a decode step’s time is part compute and part weight-loading, and weight-loading is about a third of it. Under load-balancing-only routing, a worker’s decode batch of 12 requests scatters across the model and touches 40 distinct experts, so it reloads all 40. Under expert-locality routing, the same batch is concentrated and touches only about 12 — the counts here are illustrative, but the shape is the point: fewer distinct experts means that weight-loading third shrinks, so each decode step gets shorter. On real MoE models the measured drop is 5.9–13.9% median TPOT.

PrefillDecode
The
cat
sat
on
All prompt tokens processed at once (parallel)
KV cache fills up in one shot
GPU does lots of math (compute-bound)
Fast — GPU is good at parallel work
the
mat
.
Output tokens generated one at a time
Each step reads entire KV cache
GPU mostly loads data (memory-bound)
Slower — waiting for data, not computing
Prefill = one big batch (fast) → Decode = one token at a time (slower)
Routing strategyWhat it balancesWhat each worker’s batch touchesDecode latency (TPOT)
Least-loaded / load-balancing onlycompute load across workersmany, scattered experts → more weight loads per stepbaseline
ELDR (arXiv 2607.00466)compute load and expert localityfew, consistent experts → fewer weight loads per step~5.9–13.9% lower (measured, up to 40 GPUs)

Because the routing keeps each worker’s experts concentrated, the decode step reloads fewer weights and the token-to-token gap shrinks — without changing the model or the hardware. The headline is not a new architecture; it is that the router had a lever no one was pulling: two workers can carry the same load yet cost very different latency, and expert locality is what tells them apart.

Goes deeper in: LLM Serving → Prefill/Decode Disaggregation → Full Disaggregation

Related explainers

Continue in trackLLM Serving — Prefill/Decode Disaggregation: separate pools for prefill and decode

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based