What is the LLM Internals track?

Nine interactive modules covering tokenization, embeddings, self-attention, the transformer block, text generation, KV cache, quantization, batching, and paged attention.

Where should I start?

Module 1 (Tokenization). Each module stands alone but builds on the prior one.

Do I need math background?

Basic linear algebra helps, but every concept is illustrated visually first.

Why does batching improve throughput but not latency?

Batching fills GPU idle capacity so more total tokens are processed per second (throughput). But each individual request still takes the same time or slightly longer due to memory-bandwidth sharing. It's like adding passengers to a bus — more people arrive per trip, but each person's ride isn't faster.

What is the difference between static and continuous batching?

Static batching pads all sequences to the longest length and waits for the entire batch to finish. Continuous batching allows sequences to enter and exit the batch independently, eliminating padding waste and idle GPU time. Static batching burns GPU cycles on pad tokens and stalls newly arrived requests behind the slowest-finishing sequence in the current batch. Continuous batching (iteration-level scheduling) checks after every decode step and swaps completed requests out for new ones, achieving 8–23× higher throughput in the original Orca paper compared to static batching.

What limits batch size?

GPU memory, specifically the KV cache. Each request needs its own KV cache (potentially GBs for long sequences). With a 80 GB GPU, you might fit 8–16 concurrent 4K-length requests for a 70B model. Beyond KV cache, compute throughput also caps useful batch size: once memory bandwidth is saturated by streaming weights, adding more sequences stops helping and starts hurting per-request latency. Techniques like PagedAttention (virtual-memory-style KV blocks), GQA (smaller KV per request), and quantization of KV to FP8 or INT4 all push the practical batch-size ceiling much higher.

What is prefill/decode scheduling?

Prefill (processing prompts) is compute-heavy; decode (generating tokens) is memory-bandwidth-heavy. Naive mixing can cause decode latency spikes. Advanced schedulers separate or chunk prefill operations to protect decode latency. When a long prompt arrives mid-decode, it can block the next token for tens of milliseconds, hurting TPOT for every ongoing request. Chunked prefill (breaking prompts into 512-token chunks interleaved with decode steps) and full prefill/decode disaggregation (dedicating separate GPU pools) are the two common fixes, each trading some throughput for better tail latency.

How does continuous batching know when to add new requests?

The scheduler checks after each decode step. If a sequence has finished (generated its end token) or been preempted, that memory slot is freed and a waiting request can be admitted. This iteration-level scheduling runs hundreds of times per second — roughly once per token — and each check costs microseconds because the scheduler only tracks metadata, not the GPU computation. New admissions are prioritized by a policy (FCFS, priority, or fair-share) and gated by a token budget that caps how many prefill tokens can enter per iteration to protect ongoing decode latency.

LLM Batching — Static vs Continuous Batching

Why GPUs Need Batching

What is LLM Batching?

GPUs are massively parallel processors, but serving one user at a time wastes most of that capacity. Batching groups multiple requests together so the GPU processes them simultaneously, multiplying throughput. Static batching waits for all requests in a batch to finish before starting new ones — simple but wasteful when requests have different lengths. Continuous batching (used by vLLM, TGI) inserts new requests as soon as a slot opens, keeping the GPU busy. The tradeoff is memory: each request in the batch needs its own KV cache, so batch size is limited by available GPU memory.

Why GPUs Need Batching

GPUs are not like CPUs. A modern CPU has 8–32 cores optimized for sequential, low-latency work. A modern GPU has thousands of cores designed to do the same operation on many pieces of data at once — a style called SIMD (Single Instruction, Multiple Data).

CPU

8–32 cores — each large & powerful

Different colors = each core runs different work

GPU

Thousands of cores — each small & simple

Same color = all cores run the same operation (SIMD)

That architectural choice pays off enormously for matrix multiplications — the core operation in a transformer. But it creates a catch: you need enough work to fill all those cores at once. One request, no matter how complex, won't do it.

The Factory Analogy

Imagine a factory floor with 10,000 workers. You hire them all to assemble a product. If you only give them one order at a time, 9,999 workers stand idle while one assembles. The factory is "running" but operating at 0.01% capacity.

Batching is giving the factory 10,000 orders at once — all workers stay busy, throughput skyrockets, and the cost per unit drops dramatically.

Memory-Bound vs Compute-Bound

There's a more technical reason batching matters. GPU work falls into two categories:

Memory-bound: The GPU is waiting for data to arrive from memory. Cores sit idle.
Compute-bound: The GPU is busy doing math. Cores are fully utilized.

With a single request, the weight matrices for each transformer layer need to be loaded from GPU memory for every token step. That's a lot of data movement — and with only one request using the result, you're paying the memory bandwidth cost for very little compute. The GPU is memory-bound.

Add more requests to the batch, and the same weight matrices get used to compute outputs for all of them simultaneously. Now you're doing much more compute per memory load. The GPU becomes compute-bound — which is where it's most efficient.

Researcher Kipply's "Transformer Inference Arithmetic" identifies roughly 208 tokens in flight as the threshold where a typical 7B model transitions from memory-bound to compute-bound. Below that, you're wasting capacity.

Batching doesn't make your individual request faster — it makes the GPU serve more requests per second. Throughput goes up; latency per request stays roughly the same or increases slightly. The goal is cost efficiency, not speed.

The right panel shows the batching simulator. You'll use it in the next few steps to see exactly how different batching strategies affect GPU utilization.

Static Batching

Static Batching: The Blocking Problem

The simplest way to batch requests is static batching: collect N requests, run them all together as one batch, and wait until every single request in the batch finishes before accepting new ones.

How It Works

Collect a fixed batch of N requests (e.g., N=3: R0, R1, R2)
Run all N requests in parallel, one token step at a time
Wait until the longest request finishes
Only then: release all slots and start the next batch

This works, but it has an obvious flaw.

The Idle Slot Problem

Requests have wildly different lengths. A simple "What's 2+2?" might complete in 5 tokens. A detailed "Explain quantum computing" might need 300 tokens.

In a static batch:

R0 finishes at token 20 — its slot sits idle
R1 finishes at token 20 — its slot sits idle
R2 needs 200 tokens — the batch can't end until it's done

For 180 token steps, two out of three GPU slots are doing nothing. The GPU is running at 33% utilization while paying 100% of the memory bandwidth cost.

Try It

In the right panel, the Static Batching side shows the batch in progress. Watch what happens to R0 (short, 2 steps) after it finishes — its progress bar stalls while R1 (long, 6 steps) is still running. The GPU utilization bar drops as more requests complete.

The GPU utilization metric at the bottom shows the real cost: the more length variation in a batch, the more wasted capacity.

Static batching was how most LLM serving worked from 2020 to 2022 — including early versions of HuggingFace's text generation pipelines. It's simple to implement, but it leaves significant GPU capacity on the table every time there's a length mismatch in the batch.

The fix turns out to be conceptually simple: instead of waiting for the whole batch to finish, check after every token step whether any slot has freed up — and if so, immediately fill it with a new request.

Continuous Batching

Continuous batching — also called iteration-level scheduling — solves the idle slot problem by making a simple change: check after every single token step whether any request has finished, and if so, immediately admit a new one from the queue.

The Key Insight

Static batching treats the batch as the unit of work. Continuous batching treats the token step as the unit of work.

# Static batching (simplified pseudocode)
def run_batch(requests):
    while not all_done(requests):
        run_one_step(requests)  # all requests, together
    return results              # only return when ALL done

# Continuous batching
def run_continuous(queue):
    active = []
    while queue or active:
        run_one_step(active)            # step all active requests
        finished = [r for r in active if r.is_done()]
        for r in finished:
            active.remove(r)
            yield r.result              # return immediately
            if queue:
                active.append(queue.pop())  # fill the slot now

The loop runs once per token step. Done requests leave, new ones enter. The GPU is always at full capacity.

The Numbers

The Orca paper (Yu et al., 2022) introduced iteration-level scheduling. Combined with vLLM's PagedAttention (which we'll cover in the next module), this approach delivers roughly 23× higher throughput than naive static batching — without any changes to the model itself.

vLLM's 2023 blog reported 24× higher throughput than HuggingFace transformers on LLaMA and OPT models, with the same GPU hardware.

Try It

In the right panel, compare the Static and Continuous utilization bars side by side. Step through time using the controls:

In Static: watch R0 finish early and its slot go idle
In Continuous: when R0 finishes, R3 immediately fills the slot — utilization stays high throughout

The throughput summary at the bottom shows real-time utilization for both approaches at each time step.

Continuous batching is now the universal default. Every major LLM serving framework uses it: vLLM, HuggingFace TGI, SGLang, TensorRT-LLM. If you're deploying a model today, you're almost certainly using continuous batching whether you know it or not.

But continuous batching at the token level introduces a new scheduling challenge: some requests are in their prefill phase (processing the prompt) while others are in their decode phase (generating tokens). These two phases have very different compute characteristics — and they compete for the same GPU resources.

Prefill vs Decode Scheduling

A Quick Recap

From the KV Cache module: every request goes through two distinct phases.

Prefill — the model processes your entire prompt in one parallel pass. All tokens are handled simultaneously. This is compute-intensive: the GPU does a lot of math at once, similar to training.

Decode — the model generates one new token at a time, sequentially. Each step attends to all previous tokens via the KV cache. This is memory-intensive: the GPU is mainly reading cached key/value tensors rather than doing heavy computation.

These two phases have fundamentally different performance characteristics. Prefill is compute-bound — it benefits from a large batch of prompt tokens processed together. Decode is memory-bound — it benefits from many requests being decoded in parallel so the memory bandwidth cost is shared.

The Problem: Prefill Blocks Decode

When a new request arrives with a long prompt — say, a 4,000-token PDF — the serving system must run prefill for those 4,000 tokens before it can generate the first response token.

During that prefill, the GPU is fully occupied. Every other request in the batch — including ones already in their decode phase, where users are waiting for the next token — must pause and wait.

For a user whose request is mid-decode, this shows up as a noticeable stall: their stream of tokens stops, then resumes. The longer the new arrival's prompt, the longer everyone else waits.

Chunked Prefill: Interleaving the Work

Chunked prefill breaks a long prompt into fixed-size chunks — typically 512 tokens — and interleaves them with decode steps for other requests.

Instead of: "Run all 4,000 prefill tokens, then resume decoding for everyone"

You get: "Run 512 prefill tokens, run one decode step for everyone, run next 512 prefill tokens, decode again, ..."

This keeps decode latency smooth even when long-context requests arrive. Each decode step is delayed by one chunk's worth of prefill work — a small, bounded cost — rather than the entire prompt length.

Without chunked prefill, uploading a long PDF or pasting a large codeblock into a chat interface would cause visible stalls for every other user on the same server at that moment. Chunked prefill is why production systems can handle mixed workloads — short chat turns and long document analysis — without one type degrading the other.

Chunked prefill is supported in vLLM (enabled by default since v0.4.0) and SGLang. It's a concrete example of how scheduler design — not just model architecture — determines real-world serving quality.

Memory Limits the Batch

We've seen that continuous batching keeps GPU utilization high by filling slots immediately. But there's a hard physical constraint on how large a batch can actually be: GPU memory.

The KV Cache Memory Cost

Recall from the KV Cache module: every active request stores a KV cache — the key and value tensors for every attention head at every layer, for every token processed so far. This cache grows with context length.

A rough estimate for a 7B-parameter model (32 layers, 32 heads, 128 head dimension, FP16):

Per token: 2 × 32 layers × 2 tensors × 32 heads × 128 dims × 2 bytes ≈ 512 KB per token
512-token context: ~256 MB per request
4,096-token context: ~2 GB per request
32,768-token context: ~16 GB per request

How Context Length Shrinks Your Batch

An A100 80GB GPU has roughly 40 GB available for KV cache after loading model weights (~14 GB for a 7B model in FP16, plus overhead).

At 512 KB/token:

Context length	KV cache per request	Max batch size (40 GB)
512 tokens	~256 MB	~152 requests
4,096 tokens	~2 GB	~19 requests
32,768 tokens	~16 GB	~2 requests
128,000 tokens	~64 GB	~0–1 requests

The diagram above illustrates the tradeoff: the same GPU memory budget can accommodate many short-context requests or very few long-context ones.

Why This Matters for Throughput

From Kipply's analysis, you need roughly 208 tokens in flight simultaneously to keep a 7B GPU compute-bound. At short context lengths, you have plenty of batch slots to achieve this. At 32K context, you're down to 2 concurrent requests — which may be below the compute-bound threshold, meaning the GPU is memory-bound again for a different reason: not enough parallelism.

This is why long-context models are expensive to serve. It's not that the model itself is slower — it's that the KV cache for a 128K-token context fills most of the GPU's memory, leaving room for at most one or two concurrent users.

When you hear that GPT-4's 128K context window is expensive, this is a big part of why. Serving a single 128K-context request ties up an entire GPU's memory budget. To serve it efficiently, providers use techniques like KV cache quantization (from the Quantization module) and paged memory management — which is exactly what the next module, Paged Attention, covers.

This memory constraint is also why the next-generation serving technique — PagedAttention — matters so much: it lets you manage KV cache memory more efficiently, fitting more concurrent requests into the same physical memory.

Batching in Production

Understanding batching theory is one thing. Seeing how it plays out at scale clarifies why these engineering choices matter so much.

Throughput vs Latency: The Core Tradeoff

Larger batches = higher throughput, but potentially higher latency per request. Every request waits slightly longer to be scheduled because the system is trying to accumulate a full batch.

Throughput (requests per second) and latency (time to first token, time to complete) pull in opposite directions:

Aggressive batching: maximize GPU utilization, higher average latency
Conservative batching: lower latency, underutilized GPU, higher cost per request

Production systems tune this based on SLAs. A real-time chat interface prioritizes time-to-first-token (TTFT). A batch summarization pipeline prioritizes throughput and cost.

Priority and Preemption

Not all requests are equal. A paid subscriber's request might need to preempt a free-tier request. A system under load might need to prioritize short requests to keep median latency low.

Modern serving frameworks support:

Request priority levels — high-priority requests jump the queue
Preemption — a running request can be paused and its KV cache evicted to free memory for a higher-priority request (then resumed later via recomputation or swapping to CPU memory)

Real-World Numbers

vLLM (UC Berkeley, 2023): Reported 24× higher throughput than HuggingFace Transformers' naive serving for LLaMA and OPT models on identical hardware — achieved through continuous batching + PagedAttention.

LMSYS Chatbot Arena: Served 30,000–60,000 requests per day from a single A100 cluster using vLLM in 2023, with batching enabling a roughly 50% reduction in GPU costs compared to earlier serving approaches.

Speculative Decoding: A Different Lever

Batching improves throughput by filling GPU capacity. Speculative decoding improves decode speed differently: a small "draft" model proposes several tokens ahead, and the main model verifies all of them in one forward pass (which is fast because verification is parallel, like prefill).

If the draft is right, you get multiple tokens for the cost of one verification step — roughly 2–3× speedup on decode-heavy workloads. Speculative decoding and batching are complementary; production systems often use both.

When you send a message to ChatGPT, Claude, or Gemini, your request is almost certainly batched together with hundreds of others on the same GPU cluster — right now, as you read this. The response streaming you see is not the model waiting for you; it's the model running one decode step across the entire batch and sending your token as it's produced.