Spec-decode latency paper — Load-dependent latency model
LLMThe news. On May 14, 2026, an arXiv preprint introduced an interpretable latency model for speculative decoding in production LLM serving — one where batch sizes emerge dynamically from arrival rate rather than being predetermined by a benchmark. Per-request demand is split into load-independent and load-dependent components across prefill, drafting, and verification stages, and the effective batch size is inferred from request rate via Little's Law. The authors validate the model in vLLM across model sizes and show that it explains why spec-decoding speedups diminish as server load increases. Read the paper →
Picture the slow-shift barista. With no one waiting, pre-making a guess at your next drink is free — they had nothing else to do with that minute. Even if half their guesses are wrong, you walk up, the right cup is already there, and they look like a wizard. Now imagine the lunch rush. Every minute the barista spends on a wrong guess is a minute the next customer waits. The "guess" hasn't changed; the cost of being wrong has changed, because the slack the guess was eating into has vanished. The paper formalises exactly this for spec-decode: the load-independent part of a request (its prefill, its drafter pass) is roughly constant, but the load-dependent verify step gets packed into a wider matmul as more requests join — and a wider matmul takes longer.
The mechanism splits into three stages, each priced differently. Prefill for a request is load-independent in the typical regime — its FLOPs are paid up front, mostly compute-bound. The drafter is small and runs once per window, so its cost is also load-independent relative to the target. The verify step is where load shows up: the target model batches accepted-or-rejected probes across every in-flight request, so its matmul scales with the number of concurrent requests. At low λ the verify matmul is narrow and almost free — most of the per-token clock time is the load-independent slice. At high λ the verify matmul is the continuous-batching workhorse — the same matmul plain decode would run anyway, with extra width tacked on per request for the speculative window. The "extra width" no longer hides inside slack; it bills directly.
Where does N come from? The paper's second move is to stop treating batch size as a hyperparameter. Real servers don't pick a batch — requests arrive, sit in the engine for some service time W, and the average count in flight is L = λ × W by Little's Law. Plugging in the per-stage costs lets the model predict N from λ rather than asking the user to specify it. That's what turns the analysis into a curve instead of a number: the same drafter whose low-load speedup is, say, ~2.8× (illustrative) can decay toward ~1.0× at saturation — and the inflection point is computable from the model, not just discovered on a dashboard.
Little's Law: L = λ × W is exact for any stable queue. Service time fixed at 50 ms. As λ approaches capacity (~100 req/s), W diverges — that's the knee in the saturation chart on the right.
For practitioners, the consequence is the break-even calculation. A worked example sharpens it (numbers illustrative; the paper's actual closed form depends on drafter size, window length, kernel choice, and hardware). Suppose the load-independent slice — prefill amortized plus the drafter pass — costs 20 ms per accepted token, and the load-dependent verify slice costs 5 ms × N for a window-of-4 spec configuration. At N = 1 (idle), spec-decode pays 20 + 5 × 1 = 25 ms per accepted token, while plain decode pays its full forward pass per token at roughly 20 + 20 = 40 ms — a ~1.6× win. At N = 32 (saturated), spec pays 20 + 5 × 32 = 180 ms per token, and plain decode's same wide matmul lands at roughly the same total cost — ~1.0×, no win. The numbers are made up; the shape of the decay is the teaching point and falls out of the same algebra regardless.
How it sits next to existing spec-decode analyses
| Analysis style | What it reports | Captures load decay? | Practitioner use |
|---|---|---|---|
| Single-headline benchmark | "X× at batch B" | no | marketing one-liner |
| Acceptance-length focus (Medusa, EAGLE, PPOW) | tokens accepted / window | no — orthogonal axis | drafter architecture choice |
| vLLM dashboards / TPOT plots | measured curve in deployment | yes — empirically | post-hoc capacity check |
| Load-dependent latency model (this paper) | closed-form speedup vs λ | yes — predicted from first principles | pre-deployment break-even |
The headline isn't a new draft model architecture or a kernel — it's a vocabulary. Once a team can talk about "the load-independent slice" and "the load-dependent slice" separately, the tradeoff decision for spec-decode in any given service stops being binary ("turn it on / off") and becomes a function of expected traffic shape. A workload that spends most of its time idle benefits maximally; a workload that runs hot and steady benefits hardly at all; and the inflection point — where the verify slice eats the slack — is calculable rather than discovered.
Goes deeper in: LLM Serving → Speculative Decoding → When to Use It
Related explainers
- PPOW — window-level RL for speculative drafters — orthogonal axis: lifts the acceptance length ceiling without changing the load-decay shape
- HuggingFace — Async continuous batching — the engine plumbing that makes the load-dependent slice cheaper to share
- PreFT — Prefill-only LoRA adapters — separates prefill cost from decode cost in the multi-LoRA serving case