What is the LLM Serving track?

Seven interactive modules covering the production side of LLM inference: engine internals, speculative decoding, prefill/decode disaggregation, serving metrics and SLOs, CUDA Graphs, multi-LoRA serving, and prefix caching.

Do I need to finish LLM Internals first?

Not strictly, but understanding KV cache, batching, and PagedAttention from the LLM Internals track makes these serving topics click faster.

Who is this track for?

Engineers running LLM inference in production — people tuning vLLM / SGLang / TensorRT-LLM, choosing hardware, or debugging TTFT and P99 latency.

What is prefill/decode disaggregation in LLM serving?

Prefill/decode disaggregation separates the two phases of LLM inference onto different GPU pools. Prefill (processing the prompt) is compute-bound; decode (generating tokens) is memory-bandwidth-bound. Running them on the same GPU causes interference — prefill monopolizes compute while decode waits. Disaggregation eliminates this by dedicating specialized GPU pools to each phase, with KV cache transferred between them.

How does chunked prefill reduce time to first token?

Chunked prefill breaks a long prompt into fixed-size chunks (e.g., 512 tokens). Between chunks, the GPU processes pending decode tokens. Without chunking, a 4096-token prompt blocks all decode work for ~16ms. With chunking, decode tokens get processed every ~2ms. vLLM reports 86% TTFT improvement with chunked prefill.

What is the difference between chunked prefill and full disaggregation?

Chunked prefill is a scheduling technique on a single GPU — it interleaves prefill chunks with decode steps to reduce interference. Full disaggregation uses separate GPU pools: one for prefill, one for decode. Chunked prefill needs no extra hardware but can't fully eliminate interference. Disaggregation eliminates interference completely but requires KV cache transfer between pools via fast interconnect.

How much does KV cache transfer cost in disaggregated serving?

KV cache size scales with model size and sequence length: 2 × layers × heads × head_dim × seq_len × bytes. For Llama 3.1 70B at 2048 tokens, the KV cache is ~2.6 GB. Transfer time depends on interconnect: ~81ms over PCIe 4.0 (32 GB/s), ~5ms over InfiniBand NDR (50 GB/s), or ~3ms over NVLink (900 GB/s). If transfer time exceeds decode step time, the decode GPU idles.

When should you use disaggregated LLM serving?

Use unified serving for low QPS with short prompts (overhead not worth it). Use chunked prefill for medium QPS with mixed prompt lengths on a single node — it eliminates most interference without extra hardware. Use full disaggregation for high QPS with strict latency SLOs on multi-node clusters — DistServe achieves 7.4x more requests within SLO, and Splitwise achieves 2.35x throughput at the same cost.

Prefill/Decode Disaggregation — Visual Guide

The Interference Problem

What is Prefill/Decode Disaggregation?

Prefill/decode disaggregation separates the two phases of LLM inference — prompt processing (prefill) and token generation (decode) — onto different GPU pools. This eliminates interference between the two phases, improving throughput by 2-7x while meeting strict latency targets. Systems like Splitwise (Microsoft, 2024) and DistServe (OSDI 2024) demonstrated that disaggregation can serve 7.4x more requests within latency SLOs.

The Two Phases

Recall from the Inference Engine module: every LLM request goes through two phases. Prefill processes all prompt tokens in a single forward pass — the GPU reads the model weights once and produces outputs for every token simultaneously. This is compute-bound: the GPU's math units are the bottleneck. Decode generates output tokens one at a time, each requiring a full weight read from memory. This is Memory-bandwidth-bound: the GPU spends most of its time loading weights, not computing.

MEMORY-BANDWIDTH-BOUNDGPU & CUDA → roofline-model

A workload limited by how fast the GPU can read data from HBM, not by how fast it can compute. Decode is the canonical example — the model spends most of its time loading weights and KV cache.

Think of it like a restaurant kitchen. A prep cook (prefill) chops all the vegetables at once — they need counter space and sharp knives (compute). A line cook (decode) plates one dish at a time — they need ingredients within arm's reach (memory bandwidth). Both are essential, but they need different resources.

Prefill vs Decode on the Roofline

Same GPU, fundamentally different bottlenecks

Why Mixing Them Hurts

When prefill and decode share one GPU, they fight for resources. A long prompt arrives for prefilling — it monopolizes the GPU's compute units for its entire duration. Meanwhile, all ongoing decode requests stall. They can't generate their next token until the prefill finishes.

This hurts two metrics that users feel directly. TTFT (Time to First Token) is how long a user waits before seeing the first word of the response — a long prefill from another request delays everyone's TTFT. TPOT (Time Per Output Token) is how fast tokens stream after the first one — stalled decode means tokens stop flowing mid-response. Measurements from the Nexus paper show 8-10x slowdown when prefill and decode are mixed in the same batch.

The longer the prompt, the worse the stall. A 4096-token prefill takes ~16ms — that's 16ms where every decoding request produces zero tokens.

On the right panel, click "+ Long (6)" to add a long-prompt request while short requests are decoding. Watch how the decode cells turn to "×" (stalled) — that's the interference in action.

Chunked Prefill

The Same-GPU Fix

The interference problem has a simple scheduling fix: chunked prefill. Instead of processing all 4096 prompt tokens in one giant forward pass, break them into smaller chunks — say, 512 tokens each. After each chunk, give decode requests a turn.

How It Works

A 4096-token prompt becomes 8 chunks of 512 tokens:

Process chunk 1 (512 tokens) — ~2ms
Decode step for all active requests — ~1ms
Process chunk 2 (512 tokens) — ~2ms
Decode step for all active requests — ~1ms
... repeat until all 8 chunks processed

Without chunking, decode waits 16ms. With chunking, decode runs every 2-3ms. The total prefill time is slightly longer (overhead of re-entering the forward pass), but decode never stalls.

Real-World Results

vLLM's chunked prefill implementation shows 86% TTFT improvement compared to no chunking. TensorRT-LLM calls this "in-flight batching" — prefill chunks and decode tokens are mixed into the same batch, keeping the GPU busy on every iteration.

The key parameter is chunk size: too large and decode still stalls, too small and per-chunk overhead adds up. Typical values are 256-1024 tokens. vLLM uses max_num_batched_tokens to control this.

Limitations

Chunked prefill doesn't fully eliminate interference. The GPU still switches between compute-heavy (prefill chunks) and memory-heavy (decode) work every few milliseconds. At high QPS, this switching overhead adds up. And you still can't tune the GPU independently for each phase — it's one GPU doing both jobs, just more fairly scheduled.

Toggle chunking OFF and ON in the right panel. With chunking OFF, R1's decode shows "×" (stalled) the entire time R2 prefills. Turn chunking ON — now R2's prefill pauses every few ticks to let R1 decode. Try different chunk sizes: chunk=2 gives R1 a decode turn every 3 ticks, chunk=4 every 5 ticks. Smaller chunks = more responsive decode, but slightly longer total time.

Full Disaggregation

Beyond Chunked Prefill

Chunked prefill is a scheduling trick on one GPU. But what if we could give each phase its own GPU entirely? That's full disaggregation: separate GPU pools for prefill and decode, each optimized for its workload.

The Architecture

In a disaggregated system:

A request arrives → routed to the prefill pool
Prefill GPUs process the entire prompt (no need to chunk — the pool only does prefill)
The resulting KV cache is transferred to the decode pool
Decode GPUs generate output tokens one by one

Each pool runs only one type of work. No interference, no switching overhead.

Why It's Better

Independent scaling. If prefill is the bottleneck (long prompts, many new requests), add more prefill GPUs. If decode is the bottleneck (many concurrent streams), add more decode GPUs. With a unified setup, you can only add "more of the same."

Hardware specialization. Splitwise (Microsoft Research, 2024) showed that prefill GPUs can be compute-optimized (high FLOPS) while decode GPUs can be memory-bandwidth-optimized (high HBM bandwidth). Different GPU types for different jobs — like using different kitchen equipment for prep vs plating.

Independent parallelism tuning. DistServe (OSDI 2024) found that prefill benefits from tensor parallelism (splitting matrix multiplications across GPUs for speed) while decode benefits from pipeline parallelism (splitting layers across GPUs for throughput). You can't have both on the same GPU.

Real-World Results

System	Improvement	Metric
DistServe	7.4x	Requests served within latency SLO
Splitwise	2.35x	Throughput at same cost/power
TetriInfer	97%	TTFT improvement

Who Uses It

Nearly all production LLM serving frameworks now support disaggregation: vLLM (experimental), SGLang, Ray Serve, TensorRT-LLM, and NVIDIA Dynamo.

On the right panel, watch requests flow through two separate GPU pools. Neither pool ever stalls — the prefill pool is always prefilling, the decode pool is always decoding. Click "+ Request" to add more traffic.

The KV Cache Transfer

The Cost of Disaggregation

Disaggregation eliminates interference, but it introduces a new cost: the KV cache must move from the prefill GPU to the decode GPU. This transfer takes time, and if it's too slow, the decode GPU sits idle waiting.

How Large Is the KV Cache?

The KV cache stores key and value tensors for every layer and every token. Its size follows a simple formula:

KV cache = 2 × layers × KV heads × head_dim × sequence_length × bytes_per_value

The factor of 2 accounts for both K and V. For models using GQA (grouped-query attention), the number of KV heads is smaller than the number of attention heads — Llama 3.1 70B uses 8 KV heads instead of 64 attention heads, which shrinks the cache by 8x.

KV cache size (Llama 3.1 70B)

2×80×8×128×2048×2B

[K + V][layers][heads][dim][tokens][bytes/val]

=2.6 GB

Transfer time (2.6 GB KV cache)

PCIe 4.081 ms

InfiniBand NDR52 ms

NVLink2.9 ms

NVLink is ~28× faster than PCIe — same-node transfers are near-free.

Interconnect Matters

The transfer time depends on which interconnect connects the prefill and decode GPUs:

Interconnect	Bandwidth	Use Case
PCIe 4.0	32 GB/s	Cross-node (common but slow)
InfiniBand NDR	50 GB/s	Cross-node (data center standard)
NVLink	900 GB/s	Intra-node (same server, very fast)

For Llama 3.1 70B at 2048 tokens (~1.3 GB of KV cache), transfer takes ~41ms over PCIe, ~26ms over InfiniBand, or ~1.5ms over NVLink. A typical decode step takes ~10ms — so PCIe and InfiniBand create idle time, while NVLink is fast enough to hide the transfer.

Reducing Transfer Cost

Chunked transfer: Start sending KV cache while prefill is still running. After the first chunk of tokens is prefilled, send those KV entries immediately. By the time prefill finishes, most of the cache is already at the decode GPU.

Placement-aware scheduling: DistServe places prefill and decode pools based on cluster bandwidth topology. If two GPUs share an NVLink connection, they're paired as prefill/decode. Cross-node pairs use InfiniBand.

KV cache compression: Quantize KV entries from FP16/BF16 to INT8 or even INT4 before transfer, then dequantize on the decode side. Halves the transfer size at the cost of minor accuracy impact.

On the right panel, drag the sequence length slider and switch between models. Watch how KV cache size and transfer time grow. Try different interconnects — notice how NVLink makes the bottleneck warning disappear.

When to Use What

Choosing the Right Approach

Not every deployment needs disaggregation. The right choice depends on your scale, hardware, and latency requirements.

Decision Framework

Approach	When to Use	Complexity
Unified	Low QPS, short prompts, single GPU	None
Chunked Prefill	Medium QPS, mixed prompt lengths, single node	Low (config change)
Full Disaggregation	High QPS, strict SLOs, multi-node cluster	High (infra change)

The Key Metrics

TTFT (Time to First Token): How long until the user sees the first output token. Chunked prefill improves this dramatically (86% in vLLM) because decode isn't blocked by long prefills. Disaggregation improves it further by eliminating all interference.

TPOT (Time Per Output Token): How fast tokens stream after the first one. Chunked prefill helps slightly (decode gets regular GPU time). Disaggregation helps more — the decode pool runs uninterrupted.

Throughput: Total tokens per second across all requests. Disaggregation wins at scale because each pool operates at peak efficiency for its workload type. Splitwise showed 2.35x throughput improvement at the same cost and power budget.

When Unified Is Fine

If you're running a single GPU serving a chatbot at low QPS (< 5 requests/second) with prompts under 1K tokens, the interference is minimal. Adding chunked prefill won't noticeably improve user experience, and disaggregation would be over-engineering.

When Chunked Prefill Suffices

Medium traffic (5-50 QPS) with occasionally long prompts. Enable chunked prefill in vLLM with --enable-chunked-prefill — it's a configuration flag, not an infrastructure change. Most single-node deployments should start here.

When to Disaggregate

High traffic (50+ QPS) with strict latency SLOs (< 200ms TTFT) and a multi-node GPU cluster. You need fast interconnect (NVLink or InfiniBand) between prefill and decode pools. The infrastructure complexity is significant — separate scheduling, KV cache transfer, pool management — but the throughput and latency gains are substantial.

On the right panel, adjust QPS, prompt length, and SLO strictness. Watch how the recommended approach changes. At low QPS, unified is fine. Crank up QPS with a strict SLO and disaggregation becomes the only option that meets the target.