What is async continuous batching?

Async continuous batching is a scheduling pattern shipped in HuggingFace's `transformers` generation loop that overlaps CPU-side batch preparation with GPU compute, and uses separate CUDA streams plus events for the host-to-device input copy, the compute kernels, and the device-to-host output copy. While the GPU is running iteration N's kernels, the CPU is already composing iteration N+1 — sampling completed tokens, evicting finished sequences, allocating new KV-cache blocks, and queueing the next launch — so by the time iteration N returns there is no Python gap before iteration N+1 starts on the device.

Why does it matter compared to sync continuous batching?

Sync continuous batching leaves the GPU idle every iteration while Python prepares the next batch. On HuggingFace's reference workload (an 8K-token prompt, batch size 32, 8B-parameter model), that idle window held GPU-active time at 76.0% of the wall clock. Async overlap lifted it to 99.4%, shortening end-to-end time from 300.6 s to 234.5 s — about a 22% speedup on the same hardware, free in terms of model accuracy.

How does it relate to prefill/decode disaggregation?

They target different bubbles and are complementary levers. Prefill/decode disaggregation runs prefill and decode on physically separate GPU pools so a long prefill never blocks a low-latency decode stream — its cost is cross-machine KV transfer. Async continuous batching keeps prefill and decode on the same GPU and instead hides the CPU's batch-prep cost behind kernel execution. A serving stack can adopt either, both, or neither depending on its workload shape.

HuggingFace blog — Async continuous batching

LLM

learnaivisually.com/ai-explained/hf-async-continuous-batching

TL;DR

What is it: The HuggingFace blog post on unlocking asynchronicity in continuous batching shows how to overlap CPU-side batch preparation with GPU compute, plus run the GPU's host-to-device copy, compute, and device-to-host copy on separate CUDA streams synchronized by events.
Why it’s needed: Sync continuous batching leaves the GPU idle every iteration while Python composes the next batch and dispatches kernels; on an 8K-token, batch-32, 8B-parameter test, that idle window dropped GPU-active time to 76.0% of the wall clock.
vs previous: Earlier sync schedulers ran each iteration sequentially — schedule → prefill → decode → repeat — while async overlap runs the next iteration's batch prep on the CPU concurrent with the current iteration's GPU kernels, reaching 99.4% GPU-active time and ~22% end-to-end speedup on the reported workload.

Jargon

Continuous batching: A scheduler that adds new requests to the active batch between every decode step, rather than waiting for the whole batch to finish. Full primer →
Scheduler: The Python-side component of an inference engine that decides which requests are in the next iteration — admission, eviction, and KV-cache budgeting. Runs on the CPU.
Prefill: The first GPU forward pass over a new prompt — computes the KV cache for every prompt token in parallel. Compute-bound.
Decode: The per-step GPU forward pass that produces one new token using the existing KV cache. Memory-bandwidth-bound.
Iteration: One scheduler tick — the engine composes a batch, runs prefill for any new requests, then runs one decode step for every in-flight request. Continuous batching repeats this loop.
TTFT / TPOT: Time to first token / time per output token — the two latency metrics serving engines optimise. Async overlap improves both because the GPU is busier in the same wall clock. Full primer →
GPU bubble: An idle stretch on the GPU between two pieces of work — usually caused by the CPU not having the next kernel queued up. Async batching's whole point is to eliminate them.

The news. On May 14, 2026, HuggingFace published Unlocking Asynchronicity in Continuous Batching — an engineering deep-dive on overlapping CPU-side batch preparation with GPU compute, shipped in the transformers library's generation loop. On their reference workload (an 8K-token prompt, batch size 32, 8B-parameter model), end-to-end wall-clock time fell from 300.6 s to 234.5 s — roughly 22% speedup — with GPU-active time climbing from 76.0% to 99.4%.

Picture the restaurant line. In a synchronous kitchen the runner finishes writing the order, hands it to the cook, and stands there watching. The cook finishes plating, hands the dish to the server, and stands there watching. The server delivers, comes back, and now everyone is staring at the runner again. The kitchen has three people but at any instant the most expensive station — the cook — sits idle while one of the others does its bit. The throughput of the whole line is dictated by the slowest hand-off, not the speed of any single station.

An async kitchen runs differently. While the cook is still plating dish N, the runner is already at the next table taking the order for dish N+1; while the server is delivering, the cook starts prepping the next ticket. The same three people, the same shift, but the cook — the bottleneck — almost never stops working.

Continuous batching has the exact same anatomy. Each iteration is a tiny schedule → prefill → decode cycle. In sync mode, the GPU finishes its forward pass and then waits while Python loops over completed requests, decides which new ones to admit, allocates KV-cache blocks, packs the next batch tensor, and finally re-enters CUDA-land to issue the next kernels. HuggingFace measured that gap directly on their reference workload: the GPU active time bar sits at 76.0% of the wall clock — the remaining quarter is the GPU sitting idle while the CPU prepares the next batch.

The async pattern overlaps the two by decoupling Python's scheduling decisions from GPU kernel execution. Concretely: while the GPU is running iteration N's kernels, the Python scheduler is already composing iteration N+1 on the CPU thread — sampling completed tokens from N's outputs, evicting finished sequences, choosing new admissions, allocating fresh paged-attention blocks, and queueing the next launch. Within the GPU side, HuggingFace also splits the host-to-device input copy, the compute kernels, and the device-to-host output copy onto separate CUDA streams synchronized with CUDA events, so a single iteration's data movement and math overlap too. By the time iteration N's kernels return, iteration N+1's batch tensor is already on the device and the next kernels are already queued. The bubble closes.

The mechanism has a wrinkle the post is careful about. CUDA graphs (a common decode-loop optimization that captures the kernel sequence once and replays it many times) hold a shared memory pool across iterations — so within a single captured graph, batch N's buffers cannot be reused until batch N finishes. The HF implementation therefore runs the batch-prep ↔ compute overlap across iteration boundaries while letting each captured graph complete before the next one starts. The async win is between iterations, not inside a single graph capture.

The hero animation shows the same idea with six iteration cycles packed into one wall-clock window. In the sync layout (PROBLEM beat), each cycle alternates between CPU prep and GPU work, with an amber GPU-idle band marking the wait. In the async layout (SOLUTION beat), the CPU lane and the GPU lanes are both busy at the same wall-clock moment — different iteration indices, overlapping in time. The throughput bar climbs from 76% to ~99% — the same hardware, a different orchestration.

A worked numeric example (illustrative, derived from HF's reported ratios — exact per-iteration timings are setup-dependent). Suppose one iteration breaks down as: CPU batch prep ~10 ms, GPU forward pass ~32 ms. Sync wall-clock per iteration: 10 + 32 = **42 ms**, of which the GPU is active for 32 ms — 32 / 42 ≈ **76%** GPU-active, matching HF's measured sync number. Async overlap collapses the iteration to max(10, 32) = **32 ms**, with the GPU active essentially the full 32 ms — close to 99% active time and roughly 1.31× iterations per second for the same hardware (illustrative). HF's end-to-end 22% speedup sits below that ceiling because the first and last iterations of a run still pay an unhidden CPU cost, and other small fixed costs do not move with overlap.

How it sits next to the existing landscape

Approach	How CPU and GPU relate	GPU bubble between iterations	Reported impact (HF reference workload)
Static batching	One batch at a time, all requests run to completion together	n/a — long wait before the batch starts	baseline, easy to reason about, low utilisation (setup-dependent)
Continuous batching (sync)	New requests admitted between iterations; CPU prep blocks GPU compute	CPU batch-prep gap each iteration (setup-dependent)	76.0% GPU-active on 8K-token / batch-32 / 8B-model
Continuous batching (async overlap)	CPU batch-prep concurrent with GPU compute; H2D / compute / D2H on separate CUDA streams	≈ 0 — Python runs concurrent with kernels	99.4% GPU-active, ~22% end-to-end speedup on the same workload
Prefill/decode disaggregation	Prefill and decode on separate GPU pools	cross-machine KV transfer, not CPU idle	different goal — tail-latency at high concurrency, complementary lever

Async continuous batching is a plumbing change, not a model change. The weights, the kernels, and the KV-cache layout are unchanged — what changes is when Python runs relative to the GPU. That makes it free in terms of accuracy and additive to other serving optimisations: CUDA Graphs still capture the decode hot path, prefix caching still reuses KV blocks across requests, paged attention still keeps the cache fragmented-but-packed.

The headline isn't a new algorithm — it's a scheduler architecture that closes the CPU↔GPU bubble. The price is engineering complexity: the Python scheduler now has to reason about state from two iterations at once (the one being prepared, the one running on the GPU), CUDA-event synchronization replaces implicit ordering, and every error path — cancellation, OOM, finished sequences — has to handle the pipelined state cleanly.

Goes deeper in: LLM Serving → Inference Engine → Continuous Batching

Related explainers

vLLM v0.20 — FlashAttention 4 packing — a kernel-level win on the same prefill/decode path
vLLM v0.20 — TurboQuant 2-bit KV cache — squeezing more concurrent requests into the same HBM
AsyncFC — symbolic futures in the decode stream — a different async pattern, this one for tool calls inside the decode loop