What is expert-locality-aware decode routing (ELDR)?

ELDR (arXiv 2607.00466, Microsoft) is a router for prefill/decode-disaggregated mixture-of-experts serving. From a request's prefill expert activations it builds an 'expert signature' predicting which experts the request will use during decode, partitions decode workers into expert-locality zones with a balanced K-means step, and at serving time sends each request to the least-loaded worker whose zone matches its signature. This keeps each worker's batch touching a small, consistent set of experts, so the decode step reloads fewer expert weights and time-per-output-token drops 5.9–13.9%.

Why does routing by expert locality lower latency?

A MoE decode step must load the weights of every distinct expert the batch activates, so latency scales with how scattered those experts are. Routers that balance only compute load can leave two workers equally busy but touching very different numbers of experts, and the one dragging in more experts is slower. Concentrating each worker's requests onto a matching set of experts cuts the distinct-expert count per step, shrinking the weight-loading portion of the step and lowering TPOT — no change to the model or hardware.

How does ELDR relate to load balancing and prefill/decode disaggregation?

Expert-parallel load balancers even out compute load across GPUs, but not which experts each worker touches; ELDR keeps that load-balancing goal and adds expert locality, so it optimizes load and locality together rather than load alone (SGLang's LPLB is one example of the load-balancing-only approach). It operates in a prefill/decode-disaggregated system, where prefill and decode run on separate GPU pools: ELDR reads the prefill phase's expert activations to predict decode's experts, then routes among the decode workers, with its signature cache co-indexed with the KV cache at block granularity.

ELDR routes MoE decode by expert locality, cutting TPOT up to 13.9% — Expert-locality-aware decode routing

Jargon

MoE (mixture of experts): A model where each token is routed through only a few small sub-networks (“experts”) out of many, so it holds a lot of parameters but only activates a slice per token. The catch: a serving worker must have the weights of every expert its batch touches.
Prefill vs decode: Two phases of generation. Prefill reads the whole prompt at once; decode then emits one token at a time. ELDR reads the prefill’s expert activations to guess what decode will need.
Prefill/decode disaggregation: Running prefill and decode on separate GPU pools so a long prompt’s prefill can’t stall other requests’ decode. ELDR routes among the decode workers in that pool.
TPOT (time per output token): The average gap between two generated tokens — the latency a reader feels as text streams in. ELDR’s headline result is a 5.9–13.9% median TPOT reduction.
Expert signature: ELDR’s prediction of which experts a request will activate during decode, built from the experts it already lit up during prefill. It is the key each request carries to the router.
Balanced K-means: An offline clustering step that partitions the space of expert signatures into locality zones — one per decode worker — while keeping the zones evenly sized so no worker is overloaded.
Locality-band routing: The serving-time rule: send each request to the least-loaded worker among those whose zone matches its signature. It balances load and expert locality at once, instead of load alone.

The news. On July 2, 2026, Microsoft published ELDR (Expert-Locality-aware Decode Routing) (arXiv 2607.00466), a router for prefill/decode-disaggregated MoE serving. Its claim: existing routers balance only compute load across decode workers and ignore which experts each request will touch, so identically-loaded workers can have very different latency. By predicting a request’s experts and routing for locality, ELDR cuts median time-per-output-token by 5.9–13.9% across three MoE models and two workloads, on deployments up to 40 GPUs. Read the paper →

Picture a busy coffee bar. Each drink needs a specific handful of syrups, and each barista can only keep a few syrups within reach — anything else means a walk to the back stockroom, and that walk is the whole delay. A naive manager just sends each order to whoever is least busy. It looks balanced, but the least-busy barista often lacks the right syrups, so they keep trudging to the back. Fair on paper, slow in the cup.

That is exactly what a mixture-of-experts model faces during decode. Each decode step must load the weights of every distinct expert the batch activates — the syrups its drinks need — and the more scattered those experts are, the more weight-loading each step pays. A router that only evens out compute load ignores the one thing that actually sets the delay: how many distinct experts a worker’s batch drags in.

Here is ELDR’s move, and it is the whole idea. When a request runs prefill, it already lights up a set of experts — like a first sip that reveals the drink’s flavors. ELDR turns that into an expert signature: a guess of which experts the request will keep using through decode. Offline, a balanced K-means step carves the signature space into locality zones — one per decode worker — so each worker specializes in a cluster of experts while the zones stay evenly sized. Then at serving time, locality-band routing sends each request to the least-loaded worker whose zone matches its signature — the least-busy barista who already keeps those syrups within reach. A small signature cache rides alongside the KV cache, block for block, so a continued request never has to re-guess.

Walk the batches through it (illustrative numbers). Say a decode step’s time is part compute and part weight-loading, and weight-loading is about a third of it. Under load-balancing-only routing, a worker’s decode batch of 12 requests scatters across the model and touches 40 distinct experts, so it reloads all 40. Under expert-locality routing, the same batch is concentrated and touches only about 12 — the counts here are illustrative, but the shape is the point: fewer distinct experts means that weight-loading third shrinks, so each decode step gets shorter. On real MoE models the measured drop is 5.9–13.9% median TPOT.

Routing strategy	What it balances	What each worker’s batch touches	Decode latency (TPOT)
Least-loaded / load-balancing only	compute load across workers	many, scattered experts → more weight loads per step	baseline
ELDR (arXiv 2607.00466)	compute load and expert locality	few, consistent experts → fewer weight loads per step	~5.9–13.9% lower (measured, up to 40 GPUs)

Because the routing keeps each worker’s experts concentrated, the decode step reloads fewer weights and the token-to-token gap shrinks — without changing the model or the hardware. The headline is not a new architecture; it is that the router had a lever no one was pulling: two workers can carry the same load yet cost very different latency, and expert locality is what tells them apart.

Goes deeper in: LLM Serving → Prefill/Decode Disaggregation → Full Disaggregation

Related explainers

SGLang v0.5.14 — LPLB expert-parallel load balancing — a load-balancing approach of the kind ELDR argues is not enough on its own
AMD ATOM — prefill/decode disaggregation — the disaggregated serving setting ELDR routes inside
MobileMoE — DRAM-aware MoE scaling — why loading the right experts, not just fewer of them, is the memory cost
Grouped Query Experts — MoE routing inside attention — another take on where and how experts get selected

Continue in trackLLM Serving — Prefill/Decode Disaggregation: separate pools for prefill and decode

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based