The news. On June 26, 2026, the SGLang team released v0.5.14, with work from 56 contributors. The headline is 5x higher throughput at the same interactivity serving DeepSeek-V4 on NVIDIA GB300, driven by two new expert-parallel load balancers — Waterfill and LPLB (a linear-programming load balancer) — plus CuteDSL prefill kernels for Blackwell and int8 checkpoint pooling for linear-attention prefix caches. Read the release →

Picture a warehouse store at peak rush. The checkout lanes are the GPUs; the specialty counters — deli, pharmacy, bakery — are the model's experts, and because no single lane can hold them all, the store spreads the counters across the lanes. That spread is expert parallelism: a mixture-of-experts model has too many experts to fit on one GPU, so they live across many, and each decode step the router sends every customer (token) to the one or two counters they need. The trouble is that the rush is lumpy. This wave, everyone wants the deli; next wave, the pharmacy. So one counter gets mobbed while the rest stand idle — and the store can't close out the rush until that longest line clears.

That last clause is the whole problem, because the lanes do not finish independently. Every GPU has to meet at a sync barrier — the all-to-all that ships tokens to their experts and the answers back — and that barrier waits for the slowest lane. The GPU holding this step's most popular expert therefore sets the pace for all of them, and the fast lanes burn the difference as idle time. Add more GPUs and the imbalance can get worse, not better, because the hot expert still lives on one lane while you have paid for more lanes to stand around.

SGLang v0.5.14's fix is to stop letting one counter bottleneck the floor. It keeps redundant replicas of the hot experts — duplicate deli counters on several lanes — and then, each wave, the floor manager solves a quick assignment problem: given how many customers want each counter right now, divide every counter's line across its copies so the busiest lane does as little as possible. That floor manager is LPLB, and "as little as possible" is literal: it solves a small linear program whose objective is to minimize the maximum per-GPU load (a min-max). Waterfill is the other balancer the release pairs it with, and SGLang does not spell out how it works. The name, though, points to a classic water-filling heuristic — fill the least-loaded replica first — which would be a lighter alternative to running the LP every step.

Hold the layout fixed and walk the imbalance math (illustrative — the release reports only the end-to-end 5x). Say 8 GPUs serve a batch, and the router sends 40% of this step's tokens to one hot expert that lives on a single GPU, while another GPU draws just 5%. The step can't end until that one GPU finishes its 40%, so the other seven idle for roughly a third of the step — you own 8 GPUs but move at the speed of the busiest one. Now place 3 replicas of that hot expert and let LPLB split its tokens across them: its share per GPU falls from 40% toward about 14%, the barrier wait shrinks sharply, and the lanes finish much closer together. The win isn't a faster kernel — it's deleting the idle time that imbalance was manufacturing.

Expert-parallel balancingHow it assigns loadPer-step costBalance quality
Static / hand-tuned placementfixed expert→GPU map, set before serving~nonepoor under shifting, data-dependent routing
Waterfill (this release)the release's second balancer; name implies water-filling, internals not detaileda lighter companion to LPLB (inferred from the name)
LPLB (this release)solves a linear program to minimize the busiest GPU's loada small solve each steptightest — a min-max optimum over replicas (SGLang v0.5.14)

Where it earns its keep is exactly the regime DeepSeek-V4 lives in: a large MoE served with expert parallelism across many Blackwell GPUs, where the all-to-all and its sync barrier are a leading cost in each decode step. The release's headline — 5x higher throughput at the same interactivity — is a goodput claim: more tokens per second without making any single user wait longer. Read it as the lanes finishing together instead of seven of them waiting on one — the same hardware, far less idle time.

Goes deeper in: LLM Serving → Inference Engine → The Scheduler

Continue in trackThe Inference Engine — how the scheduler decides what runs together each step

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based