Learn AI VisuallyTracksAI Explained

Prefill/Decode Disaggregation — Visual Guide

The Interference Problem

What is Prefill/Decode Disaggregation?

Prefill/decode disaggregation separates the two phases of LLM inference — prompt processing (prefill) and token generation (decode) — onto different GPU pools. This eliminates interference between the two phases, improving throughput by 2-7x while meeting strict latency targets. Systems like Splitwise (Microsoft, 2024) and DistServe (OSDI 2024) demonstrated that disaggregation can serve 7.4x more requests within latency SLOs.

The Two Phases

Recall from the Inference Engine module: every LLM request goes through two phases. Prefill processes all prompt tokens in a single forward pass — the GPU reads the model weights once and produces outputs for every token simultaneously. This is compute-bound: the GPU's math units are the bottleneck. Decode generates output tokens one at a time, each requiring a full weight read from memory. This is Memory-bandwidth-bound: the GPU spends most of its time loading weights, not computing.

MEMORY-BANDWIDTH-BOUNDGPU & CUDA → roofline-model
A workload limited by how fast the GPU can read data from HBM, not by how fast it can compute. Decode is the canonical example — the model spends most of its time loading weights and KV cache.
PrefillDecode
The
cat
sat
on
All prompt tokens processed at once (parallel)
KV cache fills up in one shot
GPU does lots of math (compute-bound)
Fast — GPU is good at parallel work
the
→
mat
→
.
Output tokens generated one at a time
Each step reads entire KV cache
GPU mostly loads data (memory-bound)
Slower — waiting for data, not computing
Prefill = one big batch (fast) → Decode = one token at a time (slower)

Think of it like a restaurant kitchen. A prep cook (prefill) chops all the vegetables at once — they need counter space and sharp knives (compute). A line cook (decode) plates one dish at a time — they need ingredients within arm's reach (memory bandwidth). Both are essential, but they need different resources.

Prefill vs Decode on the Roofline

Decode~1 FLOP/bytePrefill~100 FLOP/byte← memory-boundcompute-bound →

Same GPU, fundamentally different bottlenecks

Why Mixing Them Hurts

When prefill and decode share one GPU, they fight for resources. A long prompt arrives for prefilling — it monopolizes the GPU's compute units for its entire duration. Meanwhile, all ongoing decode requests stall. They can't generate their next token until the prefill finishes.

This hurts two metrics that users feel directly. TTFT (Time to First Token) is how long a user waits before seeing the first word of the response — a long prefill from another request delays everyone's TTFT. TPOT (Time Per Output Token) is how fast tokens stream after the first one — stalled decode means tokens stop flowing mid-response. Measurements from the Nexus paper show 8-10x slowdown when prefill and decode are mixed in the same batch.

The longer the prompt, the worse the stall. A 4096-token prefill takes ~16ms — that's 16ms where every decoding request produces zero tokens.

On the right panel, click "+ Long (6)" to add a long-prompt request while short requests are decoding. Watch how the decode cells turn to "×" (stalled) — that's the interference in action.

Chunked Prefill

The Same-GPU Fix

The interference problem has a simple scheduling fix: chunked prefill. Instead of processing all 4096 prompt tokens in one giant forward pass, break them into smaller chunks — say, 512 tokens each. After each chunk, give decode requests a turn.

How It Works

A 4096-token prompt becomes 8 chunks of 512 tokens:

  1. Process chunk 1 (512 tokens) — ~2ms
  2. Decode step for all active requests — ~1ms
  3. Process chunk 2 (512 tokens) — ~2ms
  4. Decode step for all active requests — ~1ms
  5. ... repeat until all 8 chunks processed

Without chunking, decode waits 16ms. With chunking, decode runs every 2-3ms. The total prefill time is slightly longer (overhead of re-entering the forward pass), but decode never stalls.

WITHOUT CHUNKINGWITH CHUNKINGt0t1t2t3t4t5t6t7t8t9Req APPDDDDReq BPPPPPPDD▲ Req A stalls (dark) while Req B prefills — TTFT increasest0t1t2t3t4t5t6t7t8t9Req APPDDDDDDDDReq BPPPPPPDD✓ Req A decodes every step — no stallingPrefill (P)Decode (D)Stalled

Real-World Results

vLLM's chunked prefill implementation shows 86% TTFT improvement compared to no chunking. TensorRT-LLM calls this "in-flight batching" — prefill chunks and decode tokens are mixed into the same batch, keeping the GPU busy on every iteration.

The key parameter is chunk size: too large and decode still stalls, too small and per-chunk overhead adds up. Typical values are 256-1024 tokens. vLLM uses max_num_batched_tokens to control this.

Limitations

Chunked prefill doesn't fully eliminate interference. The GPU still switches between compute-heavy (prefill chunks) and memory-heavy (decode) work every few milliseconds. At high QPS, this switching overhead adds up. And you still can't tune the GPU independently for each phase — it's one GPU doing both jobs, just more fairly scheduled.

Toggle chunking OFF and ON in the right panel. With chunking OFF, R1's decode shows "×" (stalled) the entire time R2 prefills. Turn chunking ON — now R2's prefill pauses every few ticks to let R1 decode. Try different chunk sizes: chunk=2 gives R1 a decode turn every 3 ticks, chunk=4 every 5 ticks. Smaller chunks = more responsive decode, but slightly longer total time.

Full Disaggregation

Beyond Chunked Prefill

Chunked prefill is a scheduling trick on one GPU. But what if we could give each phase its own GPU entirely? That's full disaggregation: separate GPU pools for prefill and decode, each optimized for its workload.

The Architecture

In a disaggregated system:

  1. A request arrives → routed to the prefill pool
  2. Prefill GPUs process the entire prompt (no need to chunk — the pool only does prefill)
  3. The resulting KV cache is transferred to the decode pool
  4. Decode GPUs generate output tokens one by one

Each pool runs only one type of work. No interference, no switching overhead.

Why It's Better

Independent scaling. If prefill is the bottleneck (long prompts, many new requests), add more prefill GPUs. If decode is the bottleneck (many concurrent streams), add more decode GPUs. With a unified setup, you can only add "more of the same."

Hardware specialization. Splitwise (Microsoft Research, 2024) showed that prefill GPUs can be compute-optimized (high FLOPS) while decode GPUs can be memory-bandwidth-optimized (high HBM bandwidth). Different GPU types for different jobs — like using different kitchen equipment for prep vs plating.

Independent parallelism tuning. DistServe (OSDI 2024) found that prefill benefits from tensor parallelism (splitting matrix multiplications across GPUs for speed) while decode benefits from pipeline parallelism (splitting layers across GPUs for throughput). You can't have both on the same GPU.

Real-World Results

SystemImprovementMetric
DistServe7.4xRequests served within latency SLO
Splitwise2.35xThroughput at same cost/power
TetriInfer97%TTFT improvement

Who Uses It

Nearly all production LLM serving frameworks now support disaggregation: vLLM (experimental), SGLang, Ray Serve, TensorRT-LLM, and NVIDIA Dynamo.

On the right panel, watch requests flow through two separate GPU pools. Neither pool ever stalls — the prefill pool is always prefilling, the decode pool is always decoding. Click "+ Request" to add more traffic.

The KV Cache Transfer

The Cost of Disaggregation

Disaggregation eliminates interference, but it introduces a new cost: the KV cache must move from the prefill GPU to the decode GPU. This transfer takes time, and if it's too slow, the decode GPU sits idle waiting.

How Large Is the KV Cache?

The KV cache stores key and value tensors for every layer and every token. Its size follows a simple formula:

KV cache = 2 × layers × KV heads × head_dim × sequence_length × bytes_per_value

The factor of 2 accounts for both K and V. For models using GQA (grouped-query attention), the number of KV heads is smaller than the number of attention heads — Llama 3.1 70B uses 8 KV heads instead of 64 attention heads, which shrinks the cache by 8x.

KV cache size (Llama 3.1 70B)

2×80×8×128×2048×2B
[K + V][layers][heads][dim][tokens][bytes/val]
=2.6 GB

Transfer time (2.6 GB KV cache)

PCIe 4.081 ms
InfiniBand NDR52 ms
NVLink2.9 ms

NVLink is ~28× faster than PCIe — same-node transfers are near-free.

Interconnect Matters

The transfer time depends on which interconnect connects the prefill and decode GPUs:

InterconnectBandwidthUse Case
PCIe 4.032 GB/sCross-node (common but slow)
InfiniBand NDR50 GB/sCross-node (data center standard)
NVLink900 GB/sIntra-node (same server, very fast)

For Llama 3.1 70B at 2048 tokens (~1.3 GB of KV cache), transfer takes ~41ms over PCIe, ~26ms over InfiniBand, or ~1.5ms over NVLink. A typical decode step takes ~10ms — so PCIe and InfiniBand create idle time, while NVLink is fast enough to hide the transfer.

Reducing Transfer Cost

Chunked transfer: Start sending KV cache while prefill is still running. After the first chunk of tokens is prefilled, send those KV entries immediately. By the time prefill finishes, most of the cache is already at the decode GPU.

Placement-aware scheduling: DistServe places prefill and decode pools based on cluster bandwidth topology. If two GPUs share an NVLink connection, they're paired as prefill/decode. Cross-node pairs use InfiniBand.

KV cache compression: Quantize KV entries from FP16/BF16 to INT8 or even INT4 before transfer, then dequantize on the decode side. Halves the transfer size at the cost of minor accuracy impact.

On the right panel, drag the sequence length slider and switch between models. Watch how KV cache size and transfer time grow. Try different interconnects — notice how NVLink makes the bottleneck warning disappear.

When to Use What

Choosing the Right Approach

Not every deployment needs disaggregation. The right choice depends on your scale, hardware, and latency requirements.

Decision Framework

ApproachWhen to UseComplexity
UnifiedLow QPS, short prompts, single GPUNone
Chunked PrefillMedium QPS, mixed prompt lengths, single nodeLow (config change)
Full DisaggregationHigh QPS, strict SLOs, multi-node clusterHigh (infra change)

The Key Metrics

TTFT (Time to First Token): How long until the user sees the first output token. Chunked prefill improves this dramatically (86% in vLLM) because decode isn't blocked by long prefills. Disaggregation improves it further by eliminating all interference.

TPOT (Time Per Output Token): How fast tokens stream after the first one. Chunked prefill helps slightly (decode gets regular GPU time). Disaggregation helps more — the decode pool runs uninterrupted.

Throughput: Total tokens per second across all requests. Disaggregation wins at scale because each pool operates at peak efficiency for its workload type. Splitwise showed 2.35x throughput improvement at the same cost and power budget.

When Unified Is Fine

If you're running a single GPU serving a chatbot at low QPS (< 5 requests/second) with prompts under 1K tokens, the interference is minimal. Adding chunked prefill won't noticeably improve user experience, and disaggregation would be over-engineering.

When Chunked Prefill Suffices

Medium traffic (5-50 QPS) with occasionally long prompts. Enable chunked prefill in vLLM with --enable-chunked-prefill — it's a configuration flag, not an infrastructure change. Most single-node deployments should start here.

When to Disaggregate

High traffic (50+ QPS) with strict latency SLOs (< 200ms TTFT) and a multi-node GPU cluster. You need fast interconnect (NVLink or InfiniBand) between prefill and decode pools. The infrastructure complexity is significant — separate scheduling, KV cache transfer, pool management — but the throughput and latency gains are substantial.

On the right panel, adjust QPS, prompt length, and SLO strictness. Watch how the recommended approach changes. At low QPS, unified is fine. Crank up QPS with a strict SLO and disaggregation becomes the only option that meets the target.

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com