What is the LLM Serving track?

Seven interactive modules covering the production side of LLM inference: engine internals, speculative decoding, prefill/decode disaggregation, serving metrics and SLOs, CUDA Graphs, multi-LoRA serving, and prefix caching.

Do I need to finish LLM Internals first?

Not strictly, but understanding KV cache, batching, and PagedAttention from the LLM Internals track makes these serving topics click faster.

Who is this track for?

Engineers running LLM inference in production — people tuning vLLM / SGLang / TensorRT-LLM, choosing hardware, or debugging TTFT and P99 latency.

What are CUDA Graphs?

CUDA Graphs record a sequence of GPU kernel launches once and replay them as a single driver call. This eliminates the ~5µs per-kernel launch overhead that accumulates when a decode step fires hundreds of small kernels. A captured graph is an immutable DAG of kernel nodes, copy nodes, and event nodes that the driver can replay with a single cudaGraphLaunch call. Replay also lets the GPU issue kernels back-to-back without waiting for host round-trips, which is especially valuable when many kernels are too small to individually saturate the device — the exact situation in LLM decode.

Why do CUDA Graphs help LLM decode?

Decode fires 300-1,300 tiny kernels per token (10 kernels per Llama layer post-fusion). At ~5µs launch overhead each, that is 1.6-6.5ms of pure CPU stall per token. Replaying a captured graph collapses all launches into one call and lets the GPU run back-to-back. In eager mode, those launches serialize on a single CUDA stream and starve the GPU of work between them. With graph replay the GPU sees a pre-built work queue and stays busy. Real-world results: vLLM and SGLang report 1.4–1.9× decode throughput on H100 with graphs enabled, and most of that win shows up directly as lower TPOT.

What is the CUDA Graph capture or warmup phase?

At engine startup, vLLM and SGLang run 2 iterations for each configured batch size to capture the execution graph. This takes 30-90 seconds for a 7B model and happens before serving begins, not per request. The first iteration warms up the allocator and fills kernel caches; the second runs under cudaStreamCaptureMode to record a reusable graph. Each captured size also pins its workspace memory, which is why enabling many graph buckets raises startup time and GPU memory cost. Production engines expose flags to trim the bucket list or disable graphs entirely during development.

Why do engines pad batches to specific sizes?

CUDA Graphs require fixed tensor shapes. Engines capture graphs for a discrete bucket list (vLLM uses steps of 8 from 8 to 256, TensorRT-LLM uses powers of 2) and pad incoming batches up to the next captured size. Padded slots compute real ops but their outputs are discarded.

What happens when a batch exceeds the largest captured size?

The engine falls back to a piecewise non-graphed execution path. Per-kernel launches resume, adding back the ~5µs per kernel, which makes decode 1.4-1.9× slower on H100 compared to graph replay. This is a silent performance cliff — the system keeps working, but TPOT suddenly jumps. The fix is to raise --cudagraph-capture-sizes (vLLM) or the equivalent TensorRT-LLM flag to cover the top of your batch-size distribution, at the cost of more startup time and pinned GPU memory. Monitoring the eager-fallback count is a good signal that graph coverage is undersized.

CUDA Graphs for LLM Decode

Kernel Launch Overhead

The two function calls this module is about

A kernel is a piece of code that runs on the GPU — a matmul, a RMSNorm, a softmax. To run one, the CPU has to tell the GPU "go do this." That instruction is a function call with a specific name:

cudaLaunchKernel — "Hey GPU, run this one kernel." The CPU makes this call every single time it wants a kernel executed. If a decode step has 800 kernels, the CPU makes 800 separate calls. Each call takes about 5µs of CPU work before the GPU actually starts computing.
cudaGraphLaunch — "Hey GPU, run this entire pre-recorded sequence." The CPU makes this call once and the GPU executes all 800 kernels back-to-back. One call, not 800.

That's the whole trick CUDA Graphs play: replace many cudaLaunchKernel calls with one cudaGraphLaunch. The rest of this module is about when that trick is worth it, how the recording works, and what it costs in practice.

What is a CUDA Graph?

A CUDA Graph is a recording of a kernel sequence. You run the sequence once in "record mode" so the driver captures the order, the kernel configurations, and the memory addresses into a graph data structure. After that, calling cudaGraphLaunch(graph) replays the whole recording as a single driver command. Think of it like a macro for GPU work.

The small-kernel problem

Every cudaLaunchKernel call costs ~5µs of CPU work on an A100 — argument marshaling, driver submission, GPU front-end handshake. That cost — Kernel launch overhead — is constant regardless of how much compute the kernel does.

KERNEL LAUNCH OVERHEADGPU & CUDA → operator-fusion

The fixed CPU-side cost (~5–20 µs) of dispatching a single GPU kernel. When kernels run for less time than the launch cost, the GPU sits idle waiting for the next launch.

For a big matmul (100µs of compute), 5µs of overhead is 5% — negligible. For a tiny kernel (2µs of compute), 5µs of overhead is 71% — the CPU spends most of its time making phone calls instead of getting work done.

Each cudaLaunchKernel call is like a phone call from CPU to GPU — 5µs of overhead. For a 100µs task, that's fine. For a 2µs task, you're mostly on hold.

Try the kernel-size toggle on the right. Flip between "tiny" (2µs compute) and "big" (100µs compute) and watch the overhead-to-compute ratio. The 5µs launch cost doesn't change — only the compute block does. This is the key setup for the next step: decode kernels are tiny.

Hardware variance: 5µs is typical for A100 with warm drivers. H100 gets to sub-2µs in many paths; V100 was closer to 8µs. The ratio matters more than the absolute number.

Next: we'll see why a single Llama decode step fires 300–1,300 kernels, and why every one of them is in the "tiny" regime.

Why Decode Is the Worst Case

Prefill vs decode, revisited

Recall from the KV Cache and Prefill/Decode modules: prefill and decode have very different kernel shapes.

Prefill runs matmuls over the entire prompt at once — a few large kernels that each do tens to hundreds of µs of compute. Launch overhead is noise.

Decode runs matmuls over a single token per step — the same kernels, but with tiny inputs. Each kernel drops to 2–20µs of compute. Now the 5µs launch cost is a significant fraction.

Anatomy of one decode step

Before the kernel list, a quick aside on operator fusion. Naively, a transformer block has lots of tiny operations — residual-add, RMSNorm, SiLU, multiply, matmul. If each ran as its own CUDA kernel you'd get 20+ kernels per layer. Fusion means merging several ops into a single kernel — for example, fused_add_rms_norm does residual-add and RMSNorm in one launch using the same GPU registers. Fewer kernels = fewer launches = less HBM traffic.

Unfused (3 kernels)

HBM (read)

↓

matmul

↓

HBM (write+read)

↓

bias add

↓

HBM (write+read)

↓

ReLU

↓

HBM (write)

6 HBM accesses

Fused (1 kernel)

HBM (read)

↓

matmul
+ bias
+ ReLU

↓

HBM (write)

2 HBM accesses

3× fewer HBM accesses — same computation

The diagram shows the pattern: matmul → bias add → ReLU as 3 separate kernels writes intermediate results to HBM between every step. Fused into one kernel, the intermediates stay in registers and only the final result hits HBM. (The GPU/CUDA track has a full module on this.)

Post-fusion means "after vLLM has applied its fusion optimizations." Post-fusion, vLLM runs exactly 10 CUDA kernels per Llama transformer block:

#	Kernel	Notes
1	`fused_add_rms_norm` (pre)	Residual add + RMSNorm fused
2	`qkv_proj`	Q, K, V weights concatenated, one matmul
3	`rotary_embedding`	Applies RoPE to Q and K in one launch
4	`reshape_and_cache_flash`	Writes new K, V into the paged KV cache
5	`flash_attn_varlen_func` (paged)	Fused Q·Kᵀ + softmax + ·V
6	`o_proj`	Output projection
7	`fused_add_rms_norm` (post)	Residual + post-attention RMSNorm fused
8	`gate_up_proj`	Gate and up weights merged, one matmul
9	`silu_and_mul`	SwiGLU fused: SiLU(gate) * up
10	`down_proj`	MLP output projection

Plus 3 global kernels per step (embedding lookup, final RMSNorm, lm_head GEMM).

The cost per model

Pick a model on the right to see the total launch cost:

Llama 3 8B (32 layers): 32 × 10 + 3 = ~323 kernels/step → 1.6ms CPU stall per token
Llama 3 70B (80 layers): ~803 kernels/step → 4.0ms CPU stall
Llama 3.1 405B (126 layers): ~1,263 kernels/step → 6.3ms CPU stall
DeepSeek-V3 (61 MoE layers, ~16 kernels/block): ~980 kernels/step

For a 30ms TPOT target on Llama 70B, 13% of the entire token budget is CPU launch overhead. And the GPU is idle during most of it.

This is the gap CUDA Graphs closes.

Capture & Replay

How does CUDA Graph capture work?

A CUDA Graph is built in two phases:

Capture — the program runs the kernel sequence once in a special "recording" mode. The driver doesn't execute the kernels; it records them (with their launch configs and tensor pointers) into a graph data structure.
Replay — the program calls cudaGraphLaunch with the captured graph. The driver dispatches the entire recorded sequence to the GPU as a single command buffer. The CPU is done after one call.

Here's what that looks like in CUDA code:

cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

// Run the kernel sequence once — these don't actually execute,
// they get recorded into the graph.
launch_rms_norm(stream, ...);
launch_qkv_proj(stream, ...);
launch_rotary(stream, ...);
// ... all 803 kernels for Llama 70B ...

cudaGraph_t graph;
cudaStreamEndCapture(stream, &graph);

cudaGraphExec_t exec;
cudaGraphInstantiate(&exec, graph, NULL, NULL, 0);

// Every decode step after this: one driver call.
for (int step = 0; step < num_tokens; step++) {
    cudaGraphLaunch(exec, stream);
}

What gets captured

The op sequence (which kernel runs when)
Each kernel's launch config (grid size, block size, shared memory)
Each kernel's argument values (pointers to input/output tensors)

What doesn't get captured

Host-side control flow (if, for)
Memory allocations
Dynamic shapes — the tensor pointers and launch configs are frozen

This is why engines pre-allocate persistent I/O buffers and copy inputs into the same addresses every step — if the pointers changed, replay would read the wrong memory. We'll see the consequences in the next step.

Watch the collapse

Try the hero simulation on the right. Play it and compare:

Eager mode: CPU fires 12 launches (pink), GPU sits idle (red dashed) between each while waiting for the next command.
Graph mode: CPU fires one launch, GPU runs all 12 kernels back-to-back (indigo, no gaps).

At Llama-70B scale (800 kernels × 5µs), the saved time per token is ~4ms — enough to hit a tight TPOT SLO that would otherwise miss.

FlashAttention fuses the attention kernels into one launch. CUDA Graphs fuse all the launches of a decode step into one driver call. They're complementary: graphs still help even when individual kernels are already fused.

The Padding Ladder

Why must CUDA Graphs have fixed shapes?

A captured graph hard-codes every kernel's launch config and tensor pointer. When you replay it, those values are fixed — there's no mechanism to say "same graph but with a different batch size."

If a batch of 7 arrives and we try to replay a graph captured at batch size 4, every kernel reads the wrong amount of data. Correctness breaks.

Engines solve this by capturing a separate graph for each batch size they expect, then padding incoming batches up to the next captured size.

Real production bucket lists

Different engines bucket differently:

vLLM (v1 default) — dense at small sizes, sparser at large:

[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, ..., 256, 272, 288, ..., 512]

Steps of 8 from 8 to 256, steps of 16 beyond. Keeps padding waste low for common small-batch decode.

TensorRT-LLM — pure powers-of-2 plus a few hand-tuned midpoints:

[1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]

Simpler, but batch=17 pads all the way up to 32 — 47% waste on that request.

Try the style toggle on the right and compare avg padding waste.

How padding works

When a batch of 13 arrives to vLLM:

Round up to the next captured size: 16.
Pre-allocated I/O buffers already have 16 slots, so the extra 3 are unused.
Attention-mask metadata is updated so the 3 dummy slots don't attend to anything real.
The size-16 graph replays normally. Dummy-slot outputs are computed but discarded.

Net waste: 3 / 16 ≈ 19% extra compute and memory-bandwidth on that request. vLLM's step-of-8 ladder keeps this under ~7% in aggregate.

What happens on overflow?

When a batch size exceeds the largest captured bucket (e.g., batch=300 with max_num_seqs=256), vLLM falls back to a piecewise (non-graphed) execution path. Attention kernels run as regular torch.compile ops with individual launches. Per-kernel overhead resumes.

This is not the same as --enforce-eager, which disables graphs entirely. Piecewise still has some optimizations (torch.compile graph traces inside each op boundary), just not the full-graph replay.

Bucket granularity is a memory-vs-throughput knob. More buckets = less padding waste, but more GPU memory spent on graph workspaces. vLLM's ~36 default buckets cost roughly 2 GB of workspace on a 7B model.

Production Tradeoffs

What are the tradeoffs of CUDA Graphs in production?

CUDA Graphs aren't free. Enabling them changes three observable behaviors in your serving stack.

1. Startup cost

vLLM captures graphs for every configured batch size at startup, before accepting requests. For each size, it runs 2 warmup iterations (one profiling pass, one capture pass).

Typical numbers for a 7B model with the default bucket list (20–35 sizes):

30–90 seconds of extra startup time
Blocks the server's readiness probe

If your Kubernetes deployment has a 60-second readiness timeout, the pod gets killed mid-capture and restart-loops. You either increase the timeout, reduce max_num_seqs to cut the bucket count, or set --enforce-eager to skip capture (accepting slower decode in exchange for fast startup).

2. Memory cost

Each captured graph holds a cudaGraphExec_t plus pre-allocated persistent I/O buffers (input_ids, positions, attn metadata, output logits) sized to its batch. For a 7B model on A100-80GB:

Component	Size
Model weights (FP16)	~13 GB
Framework runtime	~1–2 GB
CUDA Graph workspaces	~1–3 GB
KV cache (remainder)	~62–64 GB

The graph memory eats 3–5% of what would otherwise be KV cache. On a 7B model at typical context lengths, that's roughly 5–10 fewer concurrent sequences.

3. The overflow cliff

Requests that exceed max_num_seqs fall to the piecewise (non-graphed) path. Decode TPOT jumps from ~28ms to ~45ms — a 1.4–1.9× cliff on H100.

You'll see this in production as a bimodal latency distribution: most requests are fast, but a subset running during traffic bursts cliff over to the slow path until batch sizes drop back under the captured max.

Try the lifecycle chart on the right. Slide max_num_seqs to see captured count, workspace GB, and resulting KV cache reduction update together.

Tuning knobs

--enforce-eager — disables graphs entirely. Slower decode but fast startup. Good for autoscaling, dev iteration, or tiny models where kernels are already compute-bound.
cudagraph_capture_sizes — explicit bucket list. Shrinking it trades padding waste for faster startup and less memory.
cudagraph_mode — FULL_AND_PIECEWISE (default: full graph for decode-only batches, piecewise for mixed batches), PIECEWISE (attention always runs outside the graph — safer on models with variable-shape attention), FULL (small models only).

If you're running vLLM in an autoscaling deployment where cold-start matters more than steady-state TPOT, --enforce-eager is often the right call. You trade ~1.5× TPOT for ~60s less startup.