Learn AI VisuallyTracksAI Explained

CUDA Graphs for LLM Decode

Kernel Launch Overhead

The two function calls this module is about

A kernel is a piece of code that runs on the GPU — a matmul, a RMSNorm, a softmax. To run one, the CPU has to tell the GPU "go do this." That instruction is a function call with a specific name:

  • cudaLaunchKernel — "Hey GPU, run this one kernel." The CPU makes this call every single time it wants a kernel executed. If a decode step has 800 kernels, the CPU makes 800 separate calls. Each call takes about 5µs of CPU work before the GPU actually starts computing.
  • cudaGraphLaunch — "Hey GPU, run this entire pre-recorded sequence." The CPU makes this call once and the GPU executes all 800 kernels back-to-back. One call, not 800.

That's the whole trick CUDA Graphs play: replace many cudaLaunchKernel calls with one cudaGraphLaunch. The rest of this module is about when that trick is worth it, how the recording works, and what it costs in practice.

What is a CUDA Graph?

A CUDA Graph is a recording of a kernel sequence. You run the sequence once in "record mode" so the driver captures the order, the kernel configurations, and the memory addresses into a graph data structure. After that, calling cudaGraphLaunch(graph) replays the whole recording as a single driver command. Think of it like a macro for GPU work.

The small-kernel problem

Every cudaLaunchKernel call costs ~5µs of CPU work on an A100 — argument marshaling, driver submission, GPU front-end handshake. That cost — Kernel launch overhead — is constant regardless of how much compute the kernel does.

KERNEL LAUNCH OVERHEADGPU & CUDA → operator-fusion
The fixed CPU-side cost (~5–20 µs) of dispatching a single GPU kernel. When kernels run for less time than the launch cost, the GPU sits idle waiting for the next launch.

For a big matmul (100µs of compute), 5µs of overhead is 5% — negligible. For a tiny kernel (2µs of compute), 5µs of overhead is 71% — the CPU spends most of its time making phone calls instead of getting work done.

Each cudaLaunchKernel call is like a phone call from CPU to GPU — 5µs of overhead. For a 100µs task, that's fine. For a 2µs task, you're mostly on hold.

Try the kernel-size toggle on the right. Flip between "tiny" (2µs compute) and "big" (100µs compute) and watch the overhead-to-compute ratio. The 5µs launch cost doesn't change — only the compute block does. This is the key setup for the next step: decode kernels are tiny.

Hardware variance: 5µs is typical for A100 with warm drivers. H100 gets to sub-2µs in many paths; V100 was closer to 8µs. The ratio matters more than the absolute number.

Next: we'll see why a single Llama decode step fires 300–1,300 kernels, and why every one of them is in the "tiny" regime.

Why Decode Is the Worst Case

Prefill vs decode, revisited

Recall from the KV Cache and Prefill/Decode modules: prefill and decode have very different kernel shapes.

PrefillDecode
The
cat
sat
on
All prompt tokens processed at once (parallel)
KV cache fills up in one shot
GPU does lots of math (compute-bound)
Fast — GPU is good at parallel work
the
→
mat
→
.
Output tokens generated one at a time
Each step reads entire KV cache
GPU mostly loads data (memory-bound)
Slower — waiting for data, not computing
Prefill = one big batch (fast) → Decode = one token at a time (slower)

Prefill runs matmuls over the entire prompt at once — a few large kernels that each do tens to hundreds of µs of compute. Launch overhead is noise.

Decode runs matmuls over a single token per step — the same kernels, but with tiny inputs. Each kernel drops to 2–20µs of compute. Now the 5µs launch cost is a significant fraction.

Anatomy of one decode step

Before the kernel list, a quick aside on operator fusion. Naively, a transformer block has lots of tiny operations — residual-add, RMSNorm, SiLU, multiply, matmul. If each ran as its own CUDA kernel you'd get 20+ kernels per layer. Fusion means merging several ops into a single kernel — for example, fused_add_rms_norm does residual-add and RMSNorm in one launch using the same GPU registers. Fewer kernels = fewer launches = less HBM traffic.

Unfused (3 kernels)

HBM (read)
↓
matmul
↓
HBM (write+read)
↓
bias add
↓
HBM (write+read)
↓
ReLU
↓
HBM (write)

6 HBM accesses

vs

Fused (1 kernel)

HBM (read)
↓
matmul
+ bias
+ ReLU
↓
HBM (write)

2 HBM accesses

3× fewer HBM accesses — same computation

The diagram shows the pattern: matmul → bias add → ReLU as 3 separate kernels writes intermediate results to HBM between every step. Fused into one kernel, the intermediates stay in registers and only the final result hits HBM. (The GPU/CUDA track has a full module on this.)

Post-fusion means "after vLLM has applied its fusion optimizations." Post-fusion, vLLM runs exactly 10 CUDA kernels per Llama transformer block:

#KernelNotes
1fused_add_rms_norm (pre)Residual add + RMSNorm fused
2qkv_projQ, K, V weights concatenated, one matmul
3rotary_embeddingApplies RoPE to Q and K in one launch
4reshape_and_cache_flashWrites new K, V into the paged KV cache
5flash_attn_varlen_func (paged)Fused Q·Kᵀ + softmax + ·V
6o_projOutput projection
7fused_add_rms_norm (post)Residual + post-attention RMSNorm fused
8gate_up_projGate and up weights merged, one matmul
9silu_and_mulSwiGLU fused: SiLU(gate) * up
10down_projMLP output projection

Plus 3 global kernels per step (embedding lookup, final RMSNorm, lm_head GEMM).

The cost per model

Pick a model on the right to see the total launch cost:

  • Llama 3 8B (32 layers): 32 × 10 + 3 = ~323 kernels/step → 1.6ms CPU stall per token
  • Llama 3 70B (80 layers): ~803 kernels/step → 4.0ms CPU stall
  • Llama 3.1 405B (126 layers): ~1,263 kernels/step → 6.3ms CPU stall
  • DeepSeek-V3 (61 MoE layers, ~16 kernels/block): ~980 kernels/step

For a 30ms TPOT target on Llama 70B, 13% of the entire token budget is CPU launch overhead. And the GPU is idle during most of it.

This is the gap CUDA Graphs closes.

Capture & Replay

How does CUDA Graph capture work?

A CUDA Graph is built in two phases:

  1. Capture — the program runs the kernel sequence once in a special "recording" mode. The driver doesn't execute the kernels; it records them (with their launch configs and tensor pointers) into a graph data structure.
  2. Replay — the program calls cudaGraphLaunch with the captured graph. The driver dispatches the entire recorded sequence to the GPU as a single command buffer. The CPU is done after one call.

Here's what that looks like in CUDA code:

cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

// Run the kernel sequence once — these don't actually execute,
// they get recorded into the graph.
launch_rms_norm(stream, ...);
launch_qkv_proj(stream, ...);
launch_rotary(stream, ...);
// ... all 803 kernels for Llama 70B ...

cudaGraph_t graph;
cudaStreamEndCapture(stream, &graph);

cudaGraphExec_t exec;
cudaGraphInstantiate(&exec, graph, NULL, NULL, 0);

// Every decode step after this: one driver call.
for (int step = 0; step < num_tokens; step++) {
    cudaGraphLaunch(exec, stream);
}

What gets captured

  • The op sequence (which kernel runs when)
  • Each kernel's launch config (grid size, block size, shared memory)
  • Each kernel's argument values (pointers to input/output tensors)

What doesn't get captured

  • Host-side control flow (if, for)
  • Memory allocations
  • Dynamic shapes — the tensor pointers and launch configs are frozen

This is why engines pre-allocate persistent I/O buffers and copy inputs into the same addresses every step — if the pointers changed, replay would read the wrong memory. We'll see the consequences in the next step.

Watch the collapse

Try the hero simulation on the right. Play it and compare:

  • Eager mode: CPU fires 12 launches (pink), GPU sits idle (red dashed) between each while waiting for the next command.
  • Graph mode: CPU fires one launch, GPU runs all 12 kernels back-to-back (indigo, no gaps).

At Llama-70B scale (800 kernels × 5µs), the saved time per token is ~4ms — enough to hit a tight TPOT SLO that would otherwise miss.

FlashAttention fuses the attention kernels into one launch. CUDA Graphs fuse all the launches of a decode step into one driver call. They're complementary: graphs still help even when individual kernels are already fused.

The Padding Ladder

Why must CUDA Graphs have fixed shapes?

A captured graph hard-codes every kernel's launch config and tensor pointer. When you replay it, those values are fixed — there's no mechanism to say "same graph but with a different batch size."

If a batch of 7 arrives and we try to replay a graph captured at batch size 4, every kernel reads the wrong amount of data. Correctness breaks.

Engines solve this by capturing a separate graph for each batch size they expect, then padding incoming batches up to the next captured size.

Real production bucket lists

Different engines bucket differently:

vLLM (v1 default) — dense at small sizes, sparser at large:

[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, ..., 256, 272, 288, ..., 512]

Steps of 8 from 8 to 256, steps of 16 beyond. Keeps padding waste low for common small-batch decode.

TensorRT-LLM — pure powers-of-2 plus a few hand-tuned midpoints:

[1, 2, 4, 8, 16, 32, 64, 128, 256, 384, 512, 1024, 2048, 4096, 8192]

Simpler, but batch=17 pads all the way up to 32 — 47% waste on that request.

Try the style toggle on the right and compare avg padding waste.

How padding works

When a batch of 13 arrives to vLLM:

  1. Round up to the next captured size: 16.
  2. Pre-allocated I/O buffers already have 16 slots, so the extra 3 are unused.
  3. Attention-mask metadata is updated so the 3 dummy slots don't attend to anything real.
  4. The size-16 graph replays normally. Dummy-slot outputs are computed but discarded.

Net waste: 3 / 16 ≈ 19% extra compute and memory-bandwidth on that request. vLLM's step-of-8 ladder keeps this under ~7% in aggregate.

What happens on overflow?

When a batch size exceeds the largest captured bucket (e.g., batch=300 with max_num_seqs=256), vLLM falls back to a piecewise (non-graphed) execution path. Attention kernels run as regular torch.compile ops with individual launches. Per-kernel overhead resumes.

This is not the same as --enforce-eager, which disables graphs entirely. Piecewise still has some optimizations (torch.compile graph traces inside each op boundary), just not the full-graph replay.

Bucket granularity is a memory-vs-throughput knob. More buckets = less padding waste, but more GPU memory spent on graph workspaces. vLLM's ~36 default buckets cost roughly 2 GB of workspace on a 7B model.

Production Tradeoffs

What are the tradeoffs of CUDA Graphs in production?

CUDA Graphs aren't free. Enabling them changes three observable behaviors in your serving stack.

1. Startup cost

vLLM captures graphs for every configured batch size at startup, before accepting requests. For each size, it runs 2 warmup iterations (one profiling pass, one capture pass).

Typical numbers for a 7B model with the default bucket list (20–35 sizes):

  • 30–90 seconds of extra startup time
  • Blocks the server's readiness probe

If your Kubernetes deployment has a 60-second readiness timeout, the pod gets killed mid-capture and restart-loops. You either increase the timeout, reduce max_num_seqs to cut the bucket count, or set --enforce-eager to skip capture (accepting slower decode in exchange for fast startup).

2. Memory cost

Each captured graph holds a cudaGraphExec_t plus pre-allocated persistent I/O buffers (input_ids, positions, attn metadata, output logits) sized to its batch. For a 7B model on A100-80GB:

ComponentSize
Model weights (FP16)~13 GB
Framework runtime~1–2 GB
CUDA Graph workspaces~1–3 GB
KV cache (remainder)~62–64 GB

The graph memory eats 3–5% of what would otherwise be KV cache. On a 7B model at typical context lengths, that's roughly 5–10 fewer concurrent sequences.

3. The overflow cliff

Requests that exceed max_num_seqs fall to the piecewise (non-graphed) path. Decode TPOT jumps from ~28ms to ~45ms — a 1.4–1.9× cliff on H100.

You'll see this in production as a bimodal latency distribution: most requests are fast, but a subset running during traffic bursts cliff over to the slow path until batch sizes drop back under the captured max.

Try the lifecycle chart on the right. Slide max_num_seqs to see captured count, workspace GB, and resulting KV cache reduction update together.

Tuning knobs

  • --enforce-eager — disables graphs entirely. Slower decode but fast startup. Good for autoscaling, dev iteration, or tiny models where kernels are already compute-bound.
  • cudagraph_capture_sizes — explicit bucket list. Shrinking it trades padding waste for faster startup and less memory.
  • cudagraph_mode — FULL_AND_PIECEWISE (default: full graph for decode-only batches, piecewise for mixed batches), PIECEWISE (attention always runs outside the graph — safer on models with variable-shape attention), FULL (small models only).

If you're running vLLM in an autoscaling deployment where cold-start matters more than steady-state TPOT, --enforce-eager is often the right call. You trade ~1.5× TPOT for ~60s less startup.

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com