What is the LLM Serving track?

Seven interactive modules covering the production side of LLM inference: engine internals, speculative decoding, prefill/decode disaggregation, serving metrics and SLOs, CUDA Graphs, multi-LoRA serving, and prefix caching.

Do I need to finish LLM Internals first?

Not strictly, but understanding KV cache, batching, and PagedAttention from the LLM Internals track makes these serving topics click faster.

Who is this track for?

Engineers running LLM inference in production — people tuning vLLM / SGLang / TensorRT-LLM, choosing hardware, or debugging TTFT and P99 latency.

How does a vLLM inference engine work?

vLLM runs a continuous loop: the scheduler picks which requests get GPU time, the memory manager allocates KV cache blocks, and the model executor runs one forward pass on all active requests simultaneously. This loop repeats thousands of times per second, processing hundreds of concurrent requests.

What is continuous batching in LLM serving?

Continuous batching (iteration-level scheduling) lets new requests join the batch immediately when other requests finish, instead of waiting for the entire batch to complete. This eliminates idle GPU slots and improves throughput 8-23x over static batching. The scheduler runs once per decode iteration, admitting waiting requests into slots freed by completions or preemptions. Introduced by the Orca paper and adopted by vLLM, TGI, and TensorRT-LLM, continuous batching is now the default for every modern LLM serving engine. The main tradeoff is that per-iteration scheduling adds some CPU overhead — but that cost is negligible compared to the GPU idle it eliminates.

What does the vLLM scheduler do?

The scheduler maintains waiting and running queues, decides which requests get GPU time each iteration within a token budget, and triggers preemption (evicting KV cache to CPU) when GPU memory is full. The token budget caps how many prefill tokens can enter per iteration, which protects ongoing decodes from latency spikes when a long prompt arrives. It also picks whether to run chunked prefill, speculative decoding, or plain decode depending on engine flags. Every decision the scheduler makes is visible in vLLM's /metrics endpoint, which exposes running-batch size, waiting-queue depth, and preemption count.

How does KV cache memory management work in inference engines?

The memory manager uses PagedAttention to split each request's KV cache into fixed-size blocks. Block tables map logical blocks (contiguous positions) to physical blocks (scattered in GPU memory). Blocks are allocated on prefill, grow during decode, and freed on completion. Blocks are typically 16 tokens wide, which keeps block-table overhead below 1% of KV memory. Shared prefixes reference the same physical blocks via copy-on-write, and preemption simply detaches block-table entries rather than moving data. The same paging machinery also supports swapping blocks to CPU RAM when GPU memory is tight, enabling much higher effective concurrency.

What is preemption in LLM serving?

When GPU memory is full and a new high-priority request arrives, the engine preempts a lower-priority request by evicting its KV cache blocks to CPU memory. The preempted request re-enters the waiting queue and resumes later when memory is available. vLLM supports two modes: swap (blocks move to CPU RAM and come back) and recompute (blocks are dropped and the prefill is redone). Recompute is often faster for short sequences; swap wins for long ones. Heavy preemption usually signals the engine is oversubscribed, and the fix is raising gpu_memory_utilization or reducing max concurrent sequences.

LLM Inference Engine Internals Explained Visually

The Inference Engine

What is an Inference Engine?

An inference engine is the system that sits between your API request and the GPU. When you send a prompt to ChatGPT, Claude, or any LLM API, an inference engine handles everything: queuing your request alongside hundreds of others, allocating GPU memory for your KV cache, running the actual model computation, and streaming tokens back to you.

The most widely used open-source inference engine is vLLM, created at UC Berkeley. This module follows vLLM's architecture to explain how inference engines work.

Why Not Just Run the Model Directly?

The naive approach — process one request at a time — wastes the GPU. A single decode step takes ~10ms, but the GPU spends most of that time waiting for memory reads, not computing. Meanwhile, dozens of other requests sit idle in a queue.

An inference engine solves this by batching hundreds of requests together and processing them all in a single GPU forward pass. The GPU reads model weights from memory once and applies them to every request simultaneously.

Naive: One at a Time

Step 1 — only R1 runs

Step 2 — only R2 runs

Step 3 — only R3 runs

GPU reads weights 3× for 3 requests

Engine: All at Once

Step 1 — all 5 run together

GPU reads weights 1× for all 5 requests

Gray = idle GPU capacity. Batching fills the GPU on every step.

In the Batching module, you learned that processing multiple requests together is more efficient than one at a time. An inference engine automates this — and goes much further, dynamically managing which requests run, how memory is allocated, and when requests join or leave the batch.

The Engine Loop

The engine runs a tight loop, repeating thousands of times per second (every ~10-50ms):

Engine Iteration Loop

Schedule

Pick which requests run

→

Allocate

Assign KV cache blocks

→

Execute

Run GPU forward pass

→

Sample

Generate next tokens

↻ repeat every ~10-50ms

Each iteration:

Schedule — the scheduler picks which requests get GPU time this step
Allocate — the memory manager assigns KV cache blocks
Execute — the model executor runs one GPU forward pass on all active requests
Sample — next tokens are generated and streamed back

This loop is the heartbeat of every LLM serving system. The rest of this module zooms into each stage.

On the right panel: Click each stage of the pipeline to preview what it does. Notice how every iteration processes ALL active requests simultaneously — not one at a time.

The Scheduler

The scheduler is the traffic controller of the inference engine. Every iteration, it decides which requests get GPU time — and which ones wait.

Three terms before we go further:

Iteration — one GPU forward pass that processes all currently-active requests at once. Not "one decoded token per request" — one batched pass over the whole running set.
Token budget — the maximum tokens (prefill + decode combined) the engine will process in a single iteration. Set globally at engine startup, not per request.
GPU memory budget — the number of KV-cache blocks the memory manager is allowed to hold across all active requests. Also global.

Every admission, preemption, and gauge below is measured in these units.

Two Queues

The scheduler maintains two lists:

Waiting queue — new requests that haven't started yet. They're queued in order of arrival (FCFS — first come, first served).
Running set — requests that are actively generating tokens on the GPU. These get priority: the scheduler never starves a running request to admit a new one.

Each iteration, the scheduler tries to move requests from waiting → running, subject to two constraints.

Two Constraints: Token Budget and GPU Memory

The scheduler admits one new request per iteration, but only if it passes two checks:

1. Token budget — the maximum tokens the engine will process in one iteration (e.g., 4096 in production). A prefilling request consumes its entire prompt length from the budget (a 512-token prompt uses 512). A decoding request consumes just 1 token. So once requests are running, the engine can sustain dozens of them — each only costs 1 token per step. But admitting a new request with a long prompt temporarily eats a big chunk of the budget.

2. GPU memory — every active request needs KV cache blocks in GPU memory. A request with a 512-token prompt needs 32 blocks (at 16 tokens per block). If there aren't enough free blocks, the new request can't start — even if there's token budget left.

In practice, a GPU like the A100 (80 GB) might hold ~60,000 blocks. With typical request sizes, this supports hundreds of concurrent requests. The token budget is usually the tighter constraint.

How Many Requests Can Run?

It depends on the phase:

During decode, each request costs only 1 token of budget. With a budget of 4096, you could have ~4000 requests decoding simultaneously (if memory allows).
During prefill, one long prompt (say 2048 tokens) eats half the budget alone, blocking other admissions for that iteration.

In vLLM V1, there's no separate "prefill phase" or "decode phase" — the scheduler treats all tokens uniformly, represented as {request_id: num_tokens}. This means a single iteration can mix prefills and decodes.

Preemption: When Memory Runs Out

GPU memory is finite. When the KV cache blocks are exhausted and a higher-priority request needs space, the scheduler preempts a running request:

The lowest-priority running request is selected
Its KV cache blocks are evicted (freed from GPU memory)
The request re-enters the waiting queue
When it's readmitted later, it must be re-prefilled (recomputed)

This is the serving equivalent of OS process swapping — temporarily evicting work to make room for other work. Preemption trades compute (re-prefilling) for fairness — no single long request can monopolize the GPU forever.

On the right panel: The simulator starts with 6 requests in the waiting queue. Click Play to watch the scheduler admit them one per iteration. Notice the two gauges: token budget (pink) and GPU memory (blue). Keep clicking "+ Request" until the memory gauge turns red — that's when preemption kicks in, evicting a running request (marked "preempted") back to waiting.

The Memory Manager

In the PagedAttention module, you learned that KV cache is split into fixed-size blocks that don't need to be contiguous in memory — like OS virtual memory pages. Now see how the engine manages those blocks at runtime, allocating and freeing them across hundreds of concurrent requests.

Block Tables

Every active request has a block table — a mapping from logical blocks to physical blocks.

Block Table: Logical → Physical

→

16 tokens

→

16 tokens

→

P14

16 tokens

→

16 tokens

Contiguous logical blocks → scattered physical blocks in GPU memory

Logical blocks are what the request sees: contiguous positions (L0, L1, L2...) for its token sequence
Physical blocks are where the data actually lives in GPU memory — scattered across whatever blocks happen to be free

The request doesn't know or care that L0 maps to physical block 7 and L1 maps to physical block 2. The block table handles the translation, just like a page table in OS virtual memory.

Block Lifecycle

Blocks go through three phases:

Allocation on prefill — when a new request starts, the memory manager grabs enough blocks for all prompt tokens. A 512-token prompt with block size 16 needs 32 blocks (512 ÷ 16 = 32). These blocks are allocated from a free pool.

Growth during decode — as the model generates new tokens, they fill the current block. When a block is full (16 tokens written), the memory manager allocates one more block from the free pool. This happens automatically each iteration.

Deallocation on finish — when a request completes (hits the end-of-sequence token or max length), all its blocks return to the free pool immediately. This frees memory for new requests.

Why This Matters

The free block pool uses a doubly-linked list for O(1) allocation and deallocation. Memory waste only occurs in the last block of each request (partially filled), keeping utilization above 96%.

This is what makes continuous batching possible: because blocks are allocated and freed dynamically per-request, the engine can smoothly handle requests arriving and completing at different times without fragmenting GPU memory.

On the right panel: Watch the memory grid as requests are processed. Each color is a different request's KV cache blocks. Click any colored block to see its block table — the logical → physical mapping. Notice how blocks are scattered (non-contiguous) but the request sees a clean sequence.

The Model Executor

The scheduler decided who runs. The memory manager allocated their KV cache blocks. Now the model executor does the actual GPU computation.

One Forward Pass for Everyone

Every iteration, the executor takes all active requests and flattens their tokens into a single long sequence. Position indices and attention masks ensure each request only attends to its own tokens — they don't interfere with each other.

Active Requests

↓ flatten into one sequence

GPU Batch Tensor

position indices reset per request

Attention Mask

each request only attends to its own tokens (colored blocks)

Why flatten? Because the most expensive part of each iteration is loading model weights from GPU memory — a 7B-parameter model means reading ~14 GB of weights every single step. If you process one request at a time, you pay that 14 GB read cost per request. But if you flatten 50 requests into one batch, you read the weights once and apply them to all 50 requests in the same pass. The compute cost per request barely increases, but you've spread that massive memory read across 50x more useful work.

Prefill vs Decode: Two Very Different Workloads

Not all iterations are equal. The engine interleaves two types of work:

Prefill processes all prompt tokens at once. A 512-token prompt means 512 tokens flow through every transformer layer in parallel. This is compute-bound — the GPU's compute units are saturated. High arithmetic intensity (many FLOPs per byte of memory read).

Decode generates one new token per request. The model reads all cached KV values but only computes attention for one new token. This is Memory-bandwidth-bound — the GPU spends most of its time reading KV cache from HBM. Low arithmetic intensity.

MEMORY-BANDWIDTH-BOUNDGPU & CUDA → roofline-model

A workload limited by how fast the GPU can read data from HBM, not by how fast it can compute. Decode is the canonical example — the model spends most of its time loading weights and KV cache.

If you completed the GPU & CUDA track's Roofline module, you'll recognize this:

ROOFLINEGPU & CUDA → roofline-model

A plot of compute throughput vs arithmetic intensity. Operations either hit the compute roof (compute-bound) or sit on the memory-bandwidth slope (memory-bound).

Prefill vs Decode on the Roofline

Same GPU, fundamentally different bottlenecks

Prefill sits near the compute roof — the GPU's math units are the bottleneck. Decode sits on the memory bandwidth slope — the GPU spends most time reading KV cache, not computing. Same GPU, same model, but fundamentally different bottlenecks depending on whether you're processing a prompt or generating tokens.

Sampling

After the forward pass, the executor has logits (raw scores) for each request's next token. These go through sampling — temperature scaling, top-p filtering — to produce the actual next token. Finished requests (hit end-of-sequence or max length) are removed from the running set, freeing their memory blocks.

On the right panel: Toggle between "Prefill" and "Decode" to see how they differ. Watch the compute vs bandwidth gauges — prefill maxes out compute, decode maxes out bandwidth. The flattened batch tensor shows how all requests are concatenated for one forward pass.

Continuous Batching in Action

Now let's put all the pieces together and see why continuous batching is such a breakthrough.

The Problem with Static Batching

Traditional static batching groups requests together and processes them as a fixed batch. The problem: requests finish at different times. A short response (20 tokens) completes long before a long response (200 tokens), but the GPU slot sits idle until the entire batch finishes.

Static Batching

t1t2t3t4t5

Continuous Batching

t1t2t3t4t5

Gray = idle GPU slot. Continuous batching fills gaps immediately.

Those gray idle slots are wasted GPU cycles. The longer the variance in response lengths, the more waste.

Continuous Batching: Fill the Gaps

Continuous batching — formally called iteration-level scheduling, introduced by the Orca paper (OSDI '22) — makes a simple but powerful change: at every iteration, the scheduler can add new requests to freed slots immediately.

The full engine iteration loop:

Scheduler picks the batch — running requests first, then admits waiting requests within the token budget
Memory manager allocates blocks for new requests, grows blocks for ongoing ones
Executor runs one forward pass on all active tokens (flattened into a single batch)
Tokens sampled for each request — finished requests freed
Scheduler re-evaluates — new requests fill the gaps left by completed ones
Repeat

No idle slots. No wasted GPU cycles. New requests start generating within one iteration (~10-50ms) of a slot opening up.

The Numbers

The throughput improvement is dramatic:

Static batching → baseline
Continuous batching → 8x throughput improvement
Continuous batching + PagedAttention (vLLM) → 23x throughput improvement

The combination of iteration-level scheduling (no idle slots) and PagedAttention (no memory fragmentation) is what makes modern LLM serving practical.

On the right panel: Watch the side-by-side comparison. Static batching (left) shows idle slots as requests finish. Continuous batching (right) immediately fills gaps with new requests. The throughput ratio at the bottom shows how much more work continuous batching gets done. Click Play to start the simulation.