HuggingFace blog — Async continuous batching
LLMThe news. On May 14, 2026, HuggingFace published Unlocking Asynchronicity in Continuous Batching — an engineering deep-dive on overlapping CPU-side batch preparation with GPU compute, shipped in the
transformerslibrary's generation loop. On their reference workload (an 8K-token prompt, batch size 32, 8B-parameter model), end-to-end wall-clock time fell from 300.6 s to 234.5 s — roughly 22% speedup — with GPU-active time climbing from 76.0% to 99.4%.
Picture the restaurant line. In a synchronous kitchen the runner finishes writing the order, hands it to the cook, and stands there watching. The cook finishes plating, hands the dish to the server, and stands there watching. The server delivers, comes back, and now everyone is staring at the runner again. The kitchen has three people but at any instant the most expensive station — the cook — sits idle while one of the others does its bit. The throughput of the whole line is dictated by the slowest hand-off, not the speed of any single station.
An async kitchen runs differently. While the cook is still plating dish N, the runner is already at the next table taking the order for dish N+1; while the server is delivering, the cook starts prepping the next ticket. The same three people, the same shift, but the cook — the bottleneck — almost never stops working.
Continuous batching has the exact same anatomy. Each iteration is a tiny schedule → prefill → decode cycle. In sync mode, the GPU finishes its forward pass and then waits while Python loops over completed requests, decides which new ones to admit, allocates KV-cache blocks, packs the next batch tensor, and finally re-enters CUDA-land to issue the next kernels. HuggingFace measured that gap directly on their reference workload: the GPU active time bar sits at 76.0% of the wall clock — the remaining quarter is the GPU sitting idle while the CPU prepares the next batch.
The async pattern overlaps the two by decoupling Python's scheduling decisions from GPU kernel execution. Concretely: while the GPU is running iteration N's kernels, the Python scheduler is already composing iteration N+1 on the CPU thread — sampling completed tokens from N's outputs, evicting finished sequences, choosing new admissions, allocating fresh paged-attention blocks, and queueing the next launch. Within the GPU side, HuggingFace also splits the host-to-device input copy, the compute kernels, and the device-to-host output copy onto separate CUDA streams synchronized with CUDA events, so a single iteration's data movement and math overlap too. By the time iteration N's kernels return, iteration N+1's batch tensor is already on the device and the next kernels are already queued. The bubble closes.
The mechanism has a wrinkle the post is careful about. CUDA graphs (a common decode-loop optimization that captures the kernel sequence once and replays it many times) hold a shared memory pool across iterations — so within a single captured graph, batch N's buffers cannot be reused until batch N finishes. The HF implementation therefore runs the batch-prep ↔ compute overlap across iteration boundaries while letting each captured graph complete before the next one starts. The async win is between iterations, not inside a single graph capture.
The hero animation shows the same idea with six iteration cycles packed into one wall-clock window. In the sync layout (PROBLEM beat), each cycle alternates between CPU prep and GPU work, with an amber GPU-idle band marking the wait. In the async layout (SOLUTION beat), the CPU lane and the GPU lanes are both busy at the same wall-clock moment — different iteration indices, overlapping in time. The throughput bar climbs from 76% to ~99% — the same hardware, a different orchestration.
A worked numeric example (illustrative, derived from HF's reported ratios — exact per-iteration timings are setup-dependent). Suppose one iteration breaks down as: CPU batch prep ~10 ms, GPU forward pass ~32 ms. Sync wall-clock per iteration: 10 + 32 = **42 ms**, of which the GPU is active for 32 ms — 32 / 42 ≈ **76%** GPU-active, matching HF's measured sync number. Async overlap collapses the iteration to max(10, 32) = **32 ms**, with the GPU active essentially the full 32 ms — close to 99% active time and roughly 1.31× iterations per second for the same hardware (illustrative). HF's end-to-end 22% speedup sits below that ceiling because the first and last iterations of a run still pay an unhidden CPU cost, and other small fixed costs do not move with overlap.
How it sits next to the existing landscape
| Approach | How CPU and GPU relate | GPU bubble between iterations | Reported impact (HF reference workload) |
|---|---|---|---|
| Static batching | One batch at a time, all requests run to completion together | n/a — long wait before the batch starts | baseline, easy to reason about, low utilisation (setup-dependent) |
| Continuous batching (sync) | New requests admitted between iterations; CPU prep blocks GPU compute | CPU batch-prep gap each iteration (setup-dependent) | 76.0% GPU-active on 8K-token / batch-32 / 8B-model |
| Continuous batching (async overlap) | CPU batch-prep concurrent with GPU compute; H2D / compute / D2H on separate CUDA streams | ≈ 0 — Python runs concurrent with kernels | 99.4% GPU-active, ~22% end-to-end speedup on the same workload |
| Prefill/decode disaggregation | Prefill and decode on separate GPU pools | cross-machine KV transfer, not CPU idle | different goal — tail-latency at high concurrency, complementary lever |
Async continuous batching is a plumbing change, not a model change. The weights, the kernels, and the KV-cache layout are unchanged — what changes is when Python runs relative to the GPU. That makes it free in terms of accuracy and additive to other serving optimisations: CUDA Graphs still capture the decode hot path, prefix caching still reuses KV blocks across requests, paged attention still keeps the cache fragmented-but-packed.
The headline isn't a new algorithm — it's a scheduler architecture that closes the CPU↔GPU bubble. The price is engineering complexity: the Python scheduler now has to reason about state from two iterations at once (the one being prepared, the one running on the GPU), CUDA-event synchronization replaces implicit ordering, and every error path — cancellation, OOM, finished sequences — has to handle the pipelined state cleanly.
Goes deeper in: LLM Serving → Inference Engine → Continuous Batching
Related explainers
- vLLM v0.20 — FlashAttention 4 packing — a kernel-level win on the same prefill/decode path
- vLLM v0.20 — TurboQuant 2-bit KV cache — squeezing more concurrent requests into the same HBM
- AsyncFC — symbolic futures in the decode stream — a different async pattern, this one for tool calls inside the decode loop