What is the LLM Serving track?

Seven interactive modules covering the production side of LLM inference: engine internals, speculative decoding, prefill/decode disaggregation, serving metrics and SLOs, CUDA Graphs, multi-LoRA serving, and prefix caching.

Do I need to finish LLM Internals first?

Not strictly, but understanding KV cache, batching, and PagedAttention from the LLM Internals track makes these serving topics click faster.

Who is this track for?

Engineers running LLM inference in production — people tuning vLLM / SGLang / TensorRT-LLM, choosing hardware, or debugging TTFT and P99 latency.

What is TTFT in LLM serving?

TTFT (Time to First Token) is how long a user waits before seeing the first word of an LLM response. It equals queue wait time plus prefill time. At low load, prefill dominates TTFT. At high load, queuing dominates. TTFT is the latency metric users feel most strongly because streaming UIs hide TPOT but not the initial pause. Prompt length makes prefill super-linear, so a 10K-token prompt can push TTFT from 100ms to 800ms on the same hardware. Optimizations that target TTFT specifically include chunked prefill, prefix caching, and full prefill/decode disaggregation.

What is the difference between throughput and goodput?

Throughput counts all completed requests per second. Goodput counts only those meeting latency SLOs. A system with 10 req/s throughput but 3 req/s goodput means 70% of users had a poor experience. Throughput alone is a misleading optimization target because a server running deep queues will finish more requests per second while making every individual request slow. Capacity planning should be done against goodput at a target SLO — the actual question is "how many requests per second can I serve while keeping P99 TTFT under 500ms," not the unbounded throughput number.

Why does P99 latency matter more than average latency?

Averages hide tail behavior. P99 means 1 in 100 requests experiences this latency or worse. At 10,000 requests per day, that is 100 frustrated users who may not return. Tail latency is usually caused by queue buildup, preemption, long prompts, or GC-like pauses — none of which show up in the average. A system with 200ms average and 2s P99 can feel slow to a meaningful fraction of users while looking healthy on a dashboard. Every production LLM service tracks P50, P95, P99, and often P99.9 TTFT and TPOT, not just the mean.

What are typical SLO targets for LLM serving?

MLPerf benchmarks use P99 TTFT < 450ms and P99 TPOT < 40ms for interactive chat with Llama 2 70B. Code completion services need tighter targets: TTFT < 100ms. Reasoning models like DeepSeek-R1 allow P99 TTFT < 2s. The right target depends on how the output is consumed: code completion shows up in-editor where 100ms feels instant, chat has users reading in real time so 40ms TPOT keeps up with reading speed, and a reasoning model spends most of its time thinking before streaming tokens, so TTFT can be more lenient. SLO choice directly shapes serving architecture.

How do SLO targets affect LLM serving architecture?

Strict TTFT targets push toward prefill/decode disaggregation to eliminate queuing behind other prefills. Strict TPOT targets require isolation from prefill interference. Throughput-focused workloads can use unified GPUs with continuous batching. A chat product with TTFT < 300ms and TPOT < 40ms typically runs chunked prefill or disaggregated pools; a batch summarization pipeline with no latency SLO runs unified high-batch serving for maximum goodput per dollar. The same model weights can serve both, but the engine configuration — and often the hardware layout — differs completely.

Serving Metrics & SLOs — TTFT, TPOT, Goodput, P99

The Metrics That Matter

What Are Serving Metrics?

Serving metrics measure what users actually experience when interacting with an LLM. Three metrics capture the complete picture: how long you wait for the first word, how fast tokens stream after that, and how long the whole response takes. Understanding these metrics is the foundation for building systems that feel fast.

Time to First Token (TTFT)

TTFT is how long a user waits before seeing the first word of the response. It's the awkward silence before the model starts talking.

TTFT has two components:

TTFT = queue wait + prefill time

At low load, prefill dominates — the model processes your prompt tokens. At high load, queuing dominates — your request waits behind other requests' prefills. This distinction matters: if TTFT is slow because of queuing, you need more capacity. If it's slow because of prefill, you need faster compute or a shorter prompt.

Time Per Output Token (TPOT)

TPOT is how fast tokens stream after the first one — the reading speed of the response. The formula:

TPOT = (E2E latency − TTFT) / (output tokens − 1)

The −1 matters: it counts the gaps between tokens, not the tokens themselves. 8 output tokens create 7 gaps. Including the first token in the denominator would contaminate a decode metric with prefill latency.

MLPerf validated that 40ms TPOT = 25 tokens/second feels like "seamless streaming" to users — tested against real ChatGPT and Perplexity usage data by MLCommons.

ITL vs TPOT: A Subtle Distinction

For a single request, mean Inter-Token Latency (ITL) equals TPOT. They're identical.

Across many requests, they diverge:

TPOT is request-weighted — each request contributes equally regardless of output length
ITL is token-weighted — longer responses dominate the average

This distinction causes real benchmark confusion. NVIDIA's GenAI-Perf excludes TTFT from ITL; the LLMPerf tool includes it. The same system can report different numbers depending on which tool you use.

Click different requests in the right panel to see how prompt length affects TTFT vs TPOT. Short prompts have small prefill time (low TTFT) but the same TPOT. Long prompts inflate TTFT while TPOT stays similar.

Percentiles & Tail Latency

Why Averages Lie

"Average TTFT: 150ms" sounds great. But what if 1% of users wait 800ms? Averages collapse the distribution into a single number, hiding the tail where real frustration lives.

Percentiles reveal the full story.

The Percentile Ladder

P50 (median) — the typical user experience. Half of requests are faster, half slower. If your P50 is good, most users are happy.

P95 — "almost everyone." 95 out of 100 requests are this fast or faster. A common SLO target for non-critical services.

P99 — "the tail." 1 in 100 requests experiences this latency or worse. The standard SLO target for production services. MLPerf uses P99 for all its benchmarks.

P999 — the far tail. Matters at scale: 10 million requests per day means 10,000 users hit P999.

The Hockey Stick

Latency distributions in LLM serving follow a characteristic shape: P50 through P90 are relatively flat, then P99 spikes sharply. This is the "hockey stick" — the blade is the tail.

The spike happens because tail requests hit worst-case conditions: they arrive during a long prefill, their KV cache gets preempted, or they land in a batch with a very long prompt. These events are rare but devastating.

The P99/P50 Gap

The ratio between P99 and P50 reveals system stability:

Gap < 2× — stable system. Tight latency distribution. Good scheduling.
Gap 2–5× — moderate interference. Some requests hit queuing or prefill contention.
Gap > 5× — severe interference. The tail is unpredictable. Likely needs architectural changes (chunked prefill, disaggregation).

At 10,000 requests per day, a 5× P99/P50 gap means 100 users experience latency 5× worse than the median. Those users don't care about your average.

In the right panel, click "+ 10 Requests" to build the histogram. Then click "+ 1 Slow" — watch P99 jump while P50 barely moves. Toggle between TTFT and TPOT distributions to see which has worse tail behavior.

Throughput vs Goodput

Throughput Lies. Goodput Tells the Truth.

Imagine a restaurant that serves 100 meals per hour. Sounds healthy — until you find out 70 of those meals arrived cold, late, or wrong. The kitchen was busy. The customers were miserable.

Throughput counts everything produced: 100 meals/hour
Goodput counts only the ones customers were happy with: 30 meals/hour

This is the same gap we're about to see in LLM serving.

Throughput: 10 req/s

Goodput: 3 req/s7 violated SLO

4×

5×

6×

7×

8×

9×

10×

Same system, same second — throughput looks healthy, goodput tells the truth

A Concrete LLM Example

Your team sets an SLO (the bar you promised users):

First word appears within 200ms (TTFT < 200ms)
Tokens stream at least 20 per second (TPOT < 50ms)

At peak traffic, your server processes 10 requests per second. But when you look at what users actually experienced:

3 requests felt snappy — both targets met ✓
7 requests felt broken — slow first word, stuttering tokens ✗

Throughput = 10 req/s — the dashboard looks green. Goodput = 3 req/s — only 3 out of 10 users actually had a good experience.

Goodput is the honest metric. It asks: how many users per second got what they came for?

The Formal Definition

From the DistServe paper (OSDI 2024):

Goodput is the maximum sustainable request rate where ≥90% of requests satisfy both TTFT and TPOT SLO targets.

In plain English: at least 9 out of every 10 requests must meet both speed bars. If only 8 out of 10 meet the bar, your system has fallen below goodput — even if throughput is climbing.

Why This Matters Specifically for LLM Serving

LLM serving has four characteristics that make the throughput-goodput gap unusually large and dangerous.

1. You can fake throughput cheaply. Just make batches bigger. Throughput goes up, but every user in that bigger batch gets slower token streaming — 30 tok/s drops to 8 tok/s. The dashboard looks healthier while users churn. Goodput catches this; throughput doesn't.

2. The tail is huge. LLM request latency varies 10–20× between the median user and the P99 user. Throughput averages this out and hides the pain. One unlucky request waits 2000ms for its first token while your "average TTFT" reads 150ms.

3. Capacity planning fails without it. If you think "we can handle 20 req/s" based on throughput, but goodput at 20 req/s is only 5 req/s, you've lied to yourself about capacity. You need 4× more GPUs than you thought.

4. It aligns with business reality. Users don't care if your system is busy. They care whether their response was fast. A chat app with 80% goodput serving 1M requests/day is losing 200,000 users per day to a bad experience. Throughput won't show that. Goodput will.

The short version: throughput measures work done. Goodput measures users served well. Those two numbers often diverge dramatically in LLM serving — and only goodput predicts retention.

The Binary Goodput Critique

There's a subtle flaw in the standard definition. It's binary: a request either meets SLO or it doesn't. A response that arrives 1ms late counts the same as one that arrives 10 seconds late — both score zero.

A 2024 paper ("Revisiting SLOs") points out this creates a perverse incentive: a system can improve its goodput by giving up on slow requests entirely. If a request is definitely going to miss SLO, dropping it costs you nothing (zero either way), and it frees up GPU for faster requests.

The paper proposes "smooth goodput" — partial credit based on how late a response arrives. 1ms late? Almost full credit. 10 seconds late? Near zero. This is active research; for now, most production systems use the binary definition.

Drag the QPS slider rightward in the simulation. At low load, throughput and goodput match — every dot is green. As load climbs, dots turn red and the goodput counter drops below throughput. That growing gap is your user experience debt.

SLO-Driven System Design

SLOs Aren't Just Numbers

An SLO (Service Level Objective) is a promise: "99% of requests will have TTFT under 200ms." But SLO targets don't just measure your system — they constrain which architectures are viable.

The numeric target by itself hides the shape that matters. Select an SLO target below to see how the violation cohort changes against the same distribution — the long tail past p99 IS what the SLO defends.

violations: 2.6%

TTFT distribution (synthetic) · p50 257ms · p90 670ms · p99 1179ms · the same shape, three different SLO targets — watch the red cohort grow as you tighten the line.

Workload Profiles

Different applications need different SLOs. The DistServe paper categorizes four profiles:

Workload	TTFT Need	TPOT Need	Example
Chatbot	Tight	Medium	ChatGPT, Claude
Code completion	Tight	Tight	Copilot, Cursor
Summarization	Loose	Medium	Document processing
Reasoning	Very loose	Loose	DeepSeek-R1

Real-World SLO Targets

MLPerf Inference publishes audited P99 SLO targets — the most authoritative benchmarks available:

Model	P99 TTFT	P99 TPOT	Context
Llama 2 70B	<450ms	<40ms	Interactive chat
DeepSeek-R1	<2s	<80ms	Reasoning (thinking time expected)
Llama 3.1 405B	<6s	<175ms	Long-context (large prompt expected)

The 40ms TPOT target (25 tok/s) has the strongest empirical backing — MLCommons validated it against real ChatGPT and Perplexity usage data.

How SLOs Force Architecture

Strict TTFT pushes toward disaggregation. On a unified GPU, your request queues behind other requests' prefills. Dedicated prefill GPUs eliminate this queuing. Chunked prefill helps but doesn't eliminate it entirely.

Strict TPOT also pushes toward disaggregation. On a unified GPU, a new prefill arriving mid-decode causes all decode requests to stall. Dedicated decode GPUs are free from prefill interference.

Code completion needs both strict — making disaggregation almost required at any meaningful QPS. Unified GPUs can't meet both TTFT < 100ms and TPOT < 30ms simultaneously when load is non-trivial.

Throughput priority (summarization, batch processing) favors unified GPUs with continuous batching and large batch sizes. TTFT can be relaxed, so the queuing penalty doesn't matter.

Set different configs on each side of the comparison panel. Try the "Chatbot" preset — which config meets both SLOs? Now switch to "Summarization" — does the winner change? The architecture that's best depends entirely on which metrics your SLO prioritizes.

Capacity Planning & Saturation

The Knee of the Curve

As you increase QPS (queries per second), latency doesn't grow linearly. It stays flat... flat... flat... then explodes.

This is the saturation curve, and finding its knee is the most important capacity planning exercise you'll do.

L=λ×W

L5in-flight= computed

λ50req/s

W0.10seconds

Derived:

λ50 req/s

Little's Law: L = λ × W is exact for any stable queue. Service time fixed at 50 ms. As λ approaches capacity (~100 req/s), W diverges — that's the knee in the saturation chart on the right.

Why It's Nonlinear

At low load, each request gets immediate GPU time. Queue wait is near zero. TTFT ≈ prefill time.

At moderate load, requests occasionally queue behind each other. P50 is fine, but P99 starts climbing — some unlucky requests hit a burst of arrivals.

At high load (approaching capacity), every request queues. The system processes them as fast as it can, but new arrivals pile up faster than old ones finish. This is queuing theory in action: latency diverges toward infinity as utilization approaches 100%.

NVIDIA's benchmarking team observed: "the difference between 40 and 45 concurrent users may be the difference between 400ms TTFT and 4000ms TTFT." A 12% load increase caused a 10× latency increase.

The Goodput Cliff

The saturation curve has a companion: the goodput cliff.

At 70% capacity: latency is well within SLO. Goodput ≈ throughput. Users are happy.

At 85% capacity: P99 starts climbing. A few requests miss SLO. Goodput is slightly below throughput.

At 95% capacity: P99 explodes. Many requests miss SLO. Goodput falls sharply even though throughput is flat.

At 100%+: goodput collapses. The system is processing requests (throughput is at max), but almost none meet SLO (goodput approaches zero). The system is busy but not useful.

Practical Lessons

Plan for burst, not average. If average traffic is 50 req/s but peak bursts hit 80, you need capacity for 80. The knee doesn't care about your average — it only cares about the peak.

Target 70–80% of saturation. This gives you headroom for traffic spikes without crossing the knee. If your saturation point is 40 req/s, plan for a max of 28–32 req/s.

Monitor goodput, not throughput. Throughput stays flat near saturation — it's a lagging indicator. Goodput drops first. If you set alerts on goodput decline, you'll catch saturation before it hits users.

Different configs have different knees. Disaggregation pushes the saturation point further right — it handles more QPS before the knee. But it costs more hardware. The right tradeoff depends on your SLO and traffic pattern.

Drag the QPS slider slowly rightward. Watch the latency curve build — find the exact QPS where the curve goes steep. That's the knee. Now check the goodput percentage: it starts at 100%, stays flat, then cliff-dives. Try switching to "Disaggregated" — the knee shifts right, buying you more headroom.