Spec-decode latency paper — Load-dependent latency model

LLM
L
Spec-decode latency hero animation. A horizontal arrival-rate slider at the top moves left to right across five operating points; an operating-point dot rides a single decay curve in the plot below, falling from a 2.8× speedup at low load to about 1.05× as the GPU saturates. A right-side stacked bar visualizes time-per-verify-cycle, with the load-independent green slice shrinking as the load-dependent pink slice grows. A payoff overlay headlines: 2.8× to 1.05× as load rises — same draft-verify protocol, only the load changed.
learnaivisually.com/ai-explained/spec-decode-latency-load-model

The news. On May 14, 2026, an arXiv preprint introduced an interpretable latency model for speculative decoding in production LLM serving — one where batch sizes emerge dynamically from arrival rate rather than being predetermined by a benchmark. Per-request demand is split into load-independent and load-dependent components across prefill, drafting, and verification stages, and the effective batch size is inferred from request rate via Little's Law. The authors validate the model in vLLM across model sizes and show that it explains why spec-decoding speedups diminish as server load increases. Read the paper →

Picture the slow-shift barista. With no one waiting, pre-making a guess at your next drink is free — they had nothing else to do with that minute. Even if half their guesses are wrong, you walk up, the right cup is already there, and they look like a wizard. Now imagine the lunch rush. Every minute the barista spends on a wrong guess is a minute the next customer waits. The "guess" hasn't changed; the cost of being wrong has changed, because the slack the guess was eating into has vanished. The paper formalises exactly this for spec-decode: the load-independent part of a request (its prefill, its drafter pass) is roughly constant, but the load-dependent verify step gets packed into a wider matmul as more requests join — and a wider matmul takes longer.

The mechanism splits into three stages, each priced differently. Prefill for a request is load-independent in the typical regime — its FLOPs are paid up front, mostly compute-bound. The drafter is small and runs once per window, so its cost is also load-independent relative to the target. The verify step is where load shows up: the target model batches accepted-or-rejected probes across every in-flight request, so its matmul scales with the number of concurrent requests. At low λ the verify matmul is narrow and almost free — most of the per-token clock time is the load-independent slice. At high λ the verify matmul is the continuous-batching workhorse — the same matmul plain decode would run anyway, with extra width tacked on per request for the speculative window. The "extra width" no longer hides inside slack; it bills directly.

Where does N come from? The paper's second move is to stop treating batch size as a hyperparameter. Real servers don't pick a batch — requests arrive, sit in the engine for some service time W, and the average count in flight is L = λ × W by Little's Law. Plugging in the per-stage costs lets the model predict N from λ rather than asking the user to specify it. That's what turns the analysis into a curve instead of a number: the same drafter whose low-load speedup is, say, ~2.8× (illustrative) can decay toward ~1.0× at saturation — and the inflection point is computable from the model, not just discovered on a dashboard.

L=λ×W
L5in-flight= computed
λ50req/s
W0.10seconds
Derived:
λ50 req/s
Wλ →

Little's Law: L = λ × W is exact for any stable queue. Service time fixed at 50 ms. As λ approaches capacity (~100 req/s), W diverges — that's the knee in the saturation chart on the right.

For practitioners, the consequence is the break-even calculation. A worked example sharpens it (numbers illustrative; the paper's actual closed form depends on drafter size, window length, kernel choice, and hardware). Suppose the load-independent slice — prefill amortized plus the drafter pass — costs 20 ms per accepted token, and the load-dependent verify slice costs 5 ms × N for a window-of-4 spec configuration. At N = 1 (idle), spec-decode pays 20 + 5 × 1 = 25 ms per accepted token, while plain decode pays its full forward pass per token at roughly 20 + 20 = 40 ms — a ~1.6× win. At N = 32 (saturated), spec pays 20 + 5 × 32 = 180 ms per token, and plain decode's same wide matmul lands at roughly the same total cost — ~1.0×, no win. The numbers are made up; the shape of the decay is the teaching point and falls out of the same algebra regardless.

How it sits next to existing spec-decode analyses

Analysis styleWhat it reportsCaptures load decay?Practitioner use
Single-headline benchmark"X× at batch B"nomarketing one-liner
Acceptance-length focus (Medusa, EAGLE, PPOW)tokens accepted / windowno — orthogonal axisdrafter architecture choice
vLLM dashboards / TPOT plotsmeasured curve in deploymentyes — empiricallypost-hoc capacity check
Load-dependent latency model (this paper)closed-form speedup vs λyes — predicted from first principlespre-deployment break-even

The headline isn't a new draft model architecture or a kernel — it's a vocabulary. Once a team can talk about "the load-independent slice" and "the load-dependent slice" separately, the tradeoff decision for spec-decode in any given service stops being binary ("turn it on / off") and becomes a function of expected traffic shape. A workload that spends most of its time idle benefits maximally; a workload that runs hot and steady benefits hardly at all; and the inflection point — where the verify slice eats the slack — is calculable rather than discovered.

Goes deeper in: LLM Serving → Speculative Decoding → When to Use It

Related explainers

Frequently Asked Questions