LLM Batching — Static vs Continuous Batching
Why GPUs Need Batching
What is LLM Batching?
GPUs are massively parallel processors, but serving one user at a time wastes most of that capacity. Batching groups multiple requests together so the GPU processes them simultaneously, multiplying throughput. Static batching waits for all requests in a batch to finish before starting new ones — simple but wasteful when requests have different lengths. Continuous batching (used by vLLM, TGI) inserts new requests as soon as a slot opens, keeping the GPU busy. The tradeoff is memory: each request in the batch needs its own KV cache, so batch size is limited by available GPU memory.
Why GPUs Need Batching
GPUs are not like CPUs. A modern CPU has 8–32 cores optimized for sequential, low-latency work. A modern GPU has thousands of cores designed to do the same operation on many pieces of data at once — a style called SIMD (Single Instruction, Multiple Data).
That architectural choice pays off enormously for matrix multiplications — the core operation in a transformer. But it creates a catch: you need enough work to fill all those cores at once. One request, no matter how complex, won't do it.
The Factory Analogy
Imagine a factory floor with 10,000 workers. You hire them all to assemble a product. If you only give them one order at a time, 9,999 workers stand idle while one assembles. The factory is "running" but operating at 0.01% capacity.
Batching is giving the factory 10,000 orders at once — all workers stay busy, throughput skyrockets, and the cost per unit drops dramatically.
Memory-Bound vs Compute-Bound
There's a more technical reason batching matters. GPU work falls into two categories:
- Memory-bound: The GPU is waiting for data to arrive from memory. Cores sit idle.
- Compute-bound: The GPU is busy doing math. Cores are fully utilized.
With a single request, the weight matrices for each transformer layer need to be loaded from GPU memory for every token step. That's a lot of data movement — and with only one request using the result, you're paying the memory bandwidth cost for very little compute. The GPU is memory-bound.
Add more requests to the batch, and the same weight matrices get used to compute outputs for all of them simultaneously. Now you're doing much more compute per memory load. The GPU becomes compute-bound — which is where it's most efficient.
Researcher Kipply's "Transformer Inference Arithmetic" identifies roughly 208 tokens in flight as the threshold where a typical 7B model transitions from memory-bound to compute-bound. Below that, you're wasting capacity.
Batching doesn't make your individual request faster — it makes the GPU serve more requests per second. Throughput goes up; latency per request stays roughly the same or increases slightly. The goal is cost efficiency, not speed.
The right panel shows the batching simulator. You'll use it in the next few steps to see exactly how different batching strategies affect GPU utilization.
Static Batching
Static Batching: The Blocking Problem
The simplest way to batch requests is static batching: collect N requests, run them all together as one batch, and wait until every single request in the batch finishes before accepting new ones.
How It Works
- Collect a fixed batch of N requests (e.g., N=3: R0, R1, R2)
- Run all N requests in parallel, one token step at a time
- Wait until the longest request finishes
- Only then: release all slots and start the next batch
This works, but it has an obvious flaw.
The Idle Slot Problem
Requests have wildly different lengths. A simple "What's 2+2?" might complete in 5 tokens. A detailed "Explain quantum computing" might need 300 tokens.
In a static batch:
- R0 finishes at token 20 — its slot sits idle
- R1 finishes at token 20 — its slot sits idle
- R2 needs 200 tokens — the batch can't end until it's done
For 180 token steps, two out of three GPU slots are doing nothing. The GPU is running at 33% utilization while paying 100% of the memory bandwidth cost.
Try It
In the right panel, the Static Batching side shows the batch in progress. Watch what happens to R0 (short, 2 steps) after it finishes — its progress bar stalls while R1 (long, 6 steps) is still running. The GPU utilization bar drops as more requests complete.
The GPU utilization metric at the bottom shows the real cost: the more length variation in a batch, the more wasted capacity.
Static batching was how most LLM serving worked from 2020 to 2022 — including early versions of HuggingFace's text generation pipelines. It's simple to implement, but it leaves significant GPU capacity on the table every time there's a length mismatch in the batch.
The fix turns out to be conceptually simple: instead of waiting for the whole batch to finish, check after every token step whether any slot has freed up — and if so, immediately fill it with a new request.
Continuous Batching
Continuous Batching
Continuous batching — also called iteration-level scheduling — solves the idle slot problem by making a simple change: check after every single token step whether any request has finished, and if so, immediately admit a new one from the queue.
The Key Insight
Static batching treats the batch as the unit of work. Continuous batching treats the token step as the unit of work.
# Static batching (simplified pseudocode)
def run_batch(requests):
while not all_done(requests):
run_one_step(requests) # all requests, together
return results # only return when ALL done
# Continuous batching
def run_continuous(queue):
active = []
while queue or active:
run_one_step(active) # step all active requests
finished = [r for r in active if r.is_done()]
for r in finished:
active.remove(r)
yield r.result # return immediately
if queue:
active.append(queue.pop()) # fill the slot now
The loop runs once per token step. Done requests leave, new ones enter. The GPU is always at full capacity.
The Numbers
The Orca paper (Yu et al., 2022) introduced iteration-level scheduling. Combined with vLLM's PagedAttention (which we'll cover in the next module), this approach delivers roughly 23× higher throughput than naive static batching — without any changes to the model itself.
vLLM's 2023 blog reported 24× higher throughput than HuggingFace transformers on LLaMA and OPT models, with the same GPU hardware.
Try It
In the right panel, compare the Static and Continuous utilization bars side by side. Step through time using the controls:
- In Static: watch R0 finish early and its slot go idle
- In Continuous: when R0 finishes, R3 immediately fills the slot — utilization stays high throughout
The throughput summary at the bottom shows real-time utilization for both approaches at each time step.
Continuous batching is now the universal default. Every major LLM serving framework uses it: vLLM, HuggingFace TGI, SGLang, TensorRT-LLM. If you're deploying a model today, you're almost certainly using continuous batching whether you know it or not.
But continuous batching at the token level introduces a new scheduling challenge: some requests are in their prefill phase (processing the prompt) while others are in their decode phase (generating tokens). These two phases have very different compute characteristics — and they compete for the same GPU resources.
Prefill vs Decode Scheduling
Prefill vs Decode Scheduling
A Quick Recap
From the KV Cache module: every request goes through two distinct phases.
Prefill — the model processes your entire prompt in one parallel pass. All tokens are handled simultaneously. This is compute-intensive: the GPU does a lot of math at once, similar to training.
Decode — the model generates one new token at a time, sequentially. Each step attends to all previous tokens via the KV cache. This is memory-intensive: the GPU is mainly reading cached key/value tensors rather than doing heavy computation.
These two phases have fundamentally different performance characteristics. Prefill is compute-bound — it benefits from a large batch of prompt tokens processed together. Decode is memory-bound — it benefits from many requests being decoded in parallel so the memory bandwidth cost is shared.
The Problem: Prefill Blocks Decode
When a new request arrives with a long prompt — say, a 4,000-token PDF — the serving system must run prefill for those 4,000 tokens before it can generate the first response token.
During that prefill, the GPU is fully occupied. Every other request in the batch — including ones already in their decode phase, where users are waiting for the next token — must pause and wait.
For a user whose request is mid-decode, this shows up as a noticeable stall: their stream of tokens stops, then resumes. The longer the new arrival's prompt, the longer everyone else waits.
Chunked Prefill: Interleaving the Work
Chunked prefill breaks a long prompt into fixed-size chunks — typically 512 tokens — and interleaves them with decode steps for other requests.
Instead of: "Run all 4,000 prefill tokens, then resume decoding for everyone"
You get: "Run 512 prefill tokens, run one decode step for everyone, run next 512 prefill tokens, decode again, ..."
This keeps decode latency smooth even when long-context requests arrive. Each decode step is delayed by one chunk's worth of prefill work — a small, bounded cost — rather than the entire prompt length.
Without chunked prefill, uploading a long PDF or pasting a large codeblock into a chat interface would cause visible stalls for every other user on the same server at that moment. Chunked prefill is why production systems can handle mixed workloads — short chat turns and long document analysis — without one type degrading the other.
Chunked prefill is supported in vLLM (enabled by default since v0.4.0) and SGLang. It's a concrete example of how scheduler design — not just model architecture — determines real-world serving quality.
Memory Limits the Batch
Memory Limits the Batch
We've seen that continuous batching keeps GPU utilization high by filling slots immediately. But there's a hard physical constraint on how large a batch can actually be: GPU memory.
The KV Cache Memory Cost
Recall from the KV Cache module: every active request stores a KV cache — the key and value tensors for every attention head at every layer, for every token processed so far. This cache grows with context length.
A rough estimate for a 7B-parameter model (32 layers, 32 heads, 128 head dimension, FP16):
- Per token: 2 × 32 layers × 2 tensors × 32 heads × 128 dims × 2 bytes ≈ 512 KB per token
- 512-token context: ~256 MB per request
- 4,096-token context: ~2 GB per request
- 32,768-token context: ~16 GB per request
How Context Length Shrinks Your Batch
An A100 80GB GPU has roughly 40 GB available for KV cache after loading model weights (~14 GB for a 7B model in FP16, plus overhead).
At 512 KB/token:
| Context length | KV cache per request | Max batch size (40 GB) |
|---|---|---|
| 512 tokens | ~256 MB | ~152 requests |
| 4,096 tokens | ~2 GB | ~19 requests |
| 32,768 tokens | ~16 GB | ~2 requests |
| 128,000 tokens | ~64 GB | ~0–1 requests |
The diagram above illustrates the tradeoff: the same GPU memory budget can accommodate many short-context requests or very few long-context ones.
Why This Matters for Throughput
From Kipply's analysis, you need roughly 208 tokens in flight simultaneously to keep a 7B GPU compute-bound. At short context lengths, you have plenty of batch slots to achieve this. At 32K context, you're down to 2 concurrent requests — which may be below the compute-bound threshold, meaning the GPU is memory-bound again for a different reason: not enough parallelism.
This is why long-context models are expensive to serve. It's not that the model itself is slower — it's that the KV cache for a 128K-token context fills most of the GPU's memory, leaving room for at most one or two concurrent users.
When you hear that GPT-4's 128K context window is expensive, this is a big part of why. Serving a single 128K-context request ties up an entire GPU's memory budget. To serve it efficiently, providers use techniques like KV cache quantization (from the Quantization module) and paged memory management — which is exactly what the next module, Paged Attention, covers.
This memory constraint is also why the next-generation serving technique — PagedAttention — matters so much: it lets you manage KV cache memory more efficiently, fitting more concurrent requests into the same physical memory.
Batching in Production
Batching in Production
Understanding batching theory is one thing. Seeing how it plays out at scale clarifies why these engineering choices matter so much.
Throughput vs Latency: The Core Tradeoff
Larger batches = higher throughput, but potentially higher latency per request. Every request waits slightly longer to be scheduled because the system is trying to accumulate a full batch.
Throughput (requests per second) and latency (time to first token, time to complete) pull in opposite directions:
- Aggressive batching: maximize GPU utilization, higher average latency
- Conservative batching: lower latency, underutilized GPU, higher cost per request
Production systems tune this based on SLAs. A real-time chat interface prioritizes time-to-first-token (TTFT). A batch summarization pipeline prioritizes throughput and cost.
Priority and Preemption
Not all requests are equal. A paid subscriber's request might need to preempt a free-tier request. A system under load might need to prioritize short requests to keep median latency low.
Modern serving frameworks support:
- Request priority levels — high-priority requests jump the queue
- Preemption — a running request can be paused and its KV cache evicted to free memory for a higher-priority request (then resumed later via recomputation or swapping to CPU memory)
Real-World Numbers
vLLM (UC Berkeley, 2023): Reported 24× higher throughput than HuggingFace Transformers' naive serving for LLaMA and OPT models on identical hardware — achieved through continuous batching + PagedAttention.
LMSYS Chatbot Arena: Served 30,000–60,000 requests per day from a single A100 cluster using vLLM in 2023, with batching enabling a roughly 50% reduction in GPU costs compared to earlier serving approaches.
Speculative Decoding: A Different Lever
Batching improves throughput by filling GPU capacity. Speculative decoding improves decode speed differently: a small "draft" model proposes several tokens ahead, and the main model verifies all of them in one forward pass (which is fast because verification is parallel, like prefill).
If the draft is right, you get multiple tokens for the cost of one verification step — roughly 2–3× speedup on decode-heavy workloads. Speculative decoding and batching are complementary; production systems often use both.
When you send a message to ChatGPT, Claude, or Gemini, your request is almost certainly batched together with hundreds of others on the same GPU cluster — right now, as you read this. The response streaming you see is not the model waiting for you; it's the model running one decode step across the entire batch and sending your token as it's produced.