What is the LLM Serving track?

Seven interactive modules covering the production side of LLM inference: engine internals, speculative decoding, prefill/decode disaggregation, serving metrics and SLOs, CUDA Graphs, multi-LoRA serving, and prefix caching.

Do I need to finish LLM Internals first?

Not strictly, but understanding KV cache, batching, and PagedAttention from the LLM Internals track makes these serving topics click faster.

Who is this track for?

Engineers running LLM inference in production — people tuning vLLM / SGLang / TensorRT-LLM, choosing hardware, or debugging TTFT and P99 latency.

How big is a typical LoRA adapter?

Typically 10-100 MB at rank 16 depending on which linear layers are adapted. Minimum ~8 MB (r=8, q and v only). Maximum ~300 MB (r=64 on all linear layers). For comparison, a 7B FP16 base model is 14 GB. Adapter size scales as roughly 2 × rank × hidden_dim summed across all adapted layers, so doubling rank doubles memory. QLoRA keeps the adapter in higher precision even when the base is 4-bit, which is why storage cost is dominated by rank and layer coverage rather than base precision. This compact size is exactly what makes multi-tenant LoRA serving economical.

Why does vLLM cap --max-lora-rank?

vLLM reserves adapter workspace memory sized to the maximum rank across all loaded adapters. Higher rank means more reserved memory, which shrinks the KV cache pool and reduces max concurrent sequences. Set max-lora-rank to the highest rank you actually need. For example, bumping max-lora-rank from 16 to 64 quadruples per-adapter workspace and can cost thousands of KV cache slots on an H100. If your adapter catalog is mixed, route high-rank adapters to a dedicated replica rather than paying the KV hit on every server. The same logic applies to --max-loras, which sets the GPU-resident adapter cache size.

What happens when an adapter isn't in GPU memory?

It is fetched from the next cache tier: CPU RAM (~10 ms promote) or disk (100-200 ms cold load). The adapter is then promoted to GPU, possibly evicting the least-recently-used adapter. Cold-adapter requests show a visible TTFT spike. The three-tier cache (GPU/CPU/disk) is managed with the same unified paging used for KV blocks, so eviction cost is metadata-only. Production workloads minimize cold misses by pinning the top-N most-used adapters with --max-loras, pre-warming the CPU tier at deploy time, and monitoring the adapter-hit-rate metric in vLLM.

Multi-LoRA Serving

The Cost Crisis

What is multi-LoRA serving?

Multi-LoRA serving means running many LoRA adapters — small fine-tuning deltas — on top of a single shared base model, swapping the active adapter per request. One 14 GB base model can host hundreds of fine-tunes at roughly the cost of one, because each adapter is only tens of megabytes. The rest of this module is about the engineering the serving stack has to do to make that economics actually work.

A LoRA adapter (Low-Rank Adaptation, Hu et al. 2021) is the small weight delta produced by fine-tuning with LoRA instead of full fine-tuning. Think of it as a tiny "personality patch" you clip onto the base model at inference time. A full 7B fine-tune is 14 GB. A LoRA adapter that gives you the same behavior change is typically around 50 MB — roughly 280x smaller.

Both bars drawn at true 1:280 scale

Full 7B fine-tune

14 GB

14,000 MB

LoRA adapter (r=16)

50 MB

↑ that tiny indigo sliver on the left is the adapter

280× smaller · same behavior change

The tenant cost math

Imagine a SaaS product with N customer tenants, each wanting their own fine-tuned 7B model.

Full fine-tunes: N separate 14 GB models. On an 80 GB A100 you can fit about 5 before you need a second GPU. 100 tenants is 20 GPUs of base-model weight. 1,000 tenants is 200 GPUs. The cost grows linearly and it grows fast.
Base + adapters: one 14 GB base once, plus N tiny adapters at ~50 MB each. 100 tenants adds 5 GB of adapters — still one GPU. 1,000 tenants adds 50 GB of adapters — still well within a single 80 GB GPU once you factor in tiered caching (step 4).

This is the core pitch, famously summarized as "hundreds of fine-tuned LLMs for the cost of one" (Predibase/LoRAX). Click the tenant-scale presets on the right (1 / 10 / 100 / 1,000) and watch the two bars diverge. The top bar grows linearly; the bottom bar is almost flat.

The merge trick is wrong for serving

If you've seen LoRA before, you may have seen the "merge" trick: compute W' = W + BA once, replace W with W', and serve the merged model as if it were a regular fine-tune. That works fine if you only ever serve one adapter. It is fatal for multi-tenant serving.

The moment you mutate W into W', the base weights are gone — you can't serve tenant B's adapter off the same GPU without re-loading 14 GB of base. Multi-tenancy requires keeping W pristine and computing the adapter contribution separately at every forward pass:

y = Wx + (BA)x

Every production multi-LoRA system (S-LoRA, Punica, LoRAX, vLLM's LoRA path, TensorRT-LLM's LoRA cache) keeps Wx and (BA)x as two distinct computations and adds the results. The whole rest of this module — SGMV kernels, unified paging, rank tradeoffs — exists because we refuse to merge.

The merged form y = (W + BA)x is what trainers write. The un-merged form y = Wx + (BA)x is what servers run. Same math, but the un-merged form is the only way one GPU can serve many tenants.

Next: we look at what B and A actually are, how big they get, and why the rank r is the knob that controls everything downstream.

LoRA Math at Serving Time

How does LoRA work at inference time?

Every adapted linear layer computes:

y = Wx + (BA)x

W is the frozen base weight, shape d_out × d_in. For a Llama-7B attention projection, d_in = d_out = 4096.
B is trainable, shape d_out × r.
A is trainable, shape r × d_in.
r is the rank — a small integer, typically 8 to 64.

The insight from the original LoRA paper is empirical: the delta produced by fine-tuning is approximately low-rank. You don't need a full d_out × d_in update matrix — a rank-r factorization BA recovers most of the quality at a fraction of the parameter count.

Why it's tiny. W has d × d = 16.7M params per projection. B and A together have 2 × r × d params. At r = 16, that's 131K params — about 0.8% of W. Across a 7B model, well under 1% of total parameters are trainable.

Drag the r slider on the right (4 to 128) and watch the B and A strips grow along their narrow axis. The big W box never moves — that's the whole point.

Adapter size formula

For an FP16 adapter over m target modules, all with dimension d:

adapter_size = 2 × r × d × m × 2 bytes
             = 4 × r × d × m  bytes

The 2 × r × d counts params in B + A; the trailing × 2 is FP16 (2 bytes per param).

Worked example. Llama-7B, d = 4096, r = 16, 4 target modules (q_proj, k_proj, v_proj, o_proj):

4 × 16 × 4096 × 4  =  1,048,576 bytes per module
× 4 modules        =  4,194,304 bytes
                   ≈  4 MB just for attention projections

With LoRA scaling factors and overhead, real HuggingFace uploads for "all-attention at r=16 on 7B" land around 45–50 MB. Real filenames you can go download:

tloen/alpaca-lora-7b/adapter_model.safetensors — ~17 MB (r=8, q + v only)
timdettmers/qlora-alpaca-7b/adapter_model.bin — ~83 MB (r=64, all-attention, the QLoRA paper weights)
huggyllama/llama-7b-lora/adapter_model.safetensors — ~97 MB (r=64, all-attention)

Toggle the target-module buttons on the right (q, v only / all-attention / all-linear) to see the count change. Each added module adds its own B/A pair.

Typical sizes across the fleet

Config	Rank	Target modules	Adapter size (7B)
Minimal	`r=8`	`q_proj`, `v_proj`	~8 MB
Typical	`r=16`	all-attention (q, k, v, o)	~50 MB
Heavy	`r=64`	all-linear (+ MLP gate/up/down)	~300 MB

Target modules — which linear layers you actually adapt — is the second knob alongside rank. q, v only is the original LoRA paper recommendation and the cheapest setting. all-attention is the modern default and the sweet spot for most tasks. all-linear (attention + MLP) gives the highest quality but 4–6x larger adapters. The right-panel toggle lets you see the size jump for each choice.

Why this matters for serving

Every number in this module flows from the adapter-size formula. Step 3 batches adapters across requests, so "adapter size" sets how much weight data a kernel pulls per segment. Step 4 caches adapters across tiers, so "adapter size" sets your working-set footprint. Step 5 trades adapter pool against KV cache, so "rank times count" is the memory the serving engine reserves up front.

Keep the slider in mind: at r = 16 you're spending ~50 MB per tenant. At r = 128 you're spending ~400 MB per tenant. That 8x is not free, and the next three steps show exactly where it shows up.

Heterogeneous Batches and SGMV

Quick recap — what's in an adapter

From step 2: every LoRA adapter is a pair of small matrices B (d_out × r) and A (r × d_in), applied as (BA)x and added to the frozen base result Wx. r is tiny (typically 16). That's the payload each request brings to the forward pass.

The heterogeneous batch problem

Recall from the Batching module: throughput on modern GPUs comes from batching — stacking many requests' tokens into one big matmul so the base weight read W gets amortized across all of them. 7B base = 14 GB weight read per decode step. 1 request alone = 14 GB for 1 result. 50 requests batched = same 14 GB, 50 results. That's the entire reason continuous batching exists.

Batching works beautifully when every request uses the same weights. Multi-LoRA breaks that assumption: request A uses adapter α, request B uses adapter β, request C uses adapter γ — each one has a different (BA) pair. The base matmul Wx still batches fine (same W for everyone), but the adapter contribution (BA)x seems to need a separate kernel per request.

Naive path: batch size 32 × 32 unique adapters = 32 separate kernel launches per layer, just for the LoRA contribution. On an 80-layer 70B model, that's 2,560 extra kernel launches per token — exactly the launch-overhead disaster the CUDA Graphs module just got you to care about.

What is SGMV (Segmented Gather Matrix-Vector)?

SGMV is the kernel that fixes this. Introduced in the Punica paper (Chen et al., 2023), it's a CUDA kernel that handles a mixed-adapter batch in a small number of launches — one per distinct adapter in the batch, not one per request.

The trick has two parts.

Part 1 — decompose the LoRA path as two matmuls. The raw math is y = x · A · B (per-request A and B). Written as one fused op, the shapes fight each other because each request's A and B are different. Punica's move: split it into two ops with an intermediate of shape r:

v  = x · A       # shape [batch, r]  — apply A
Δy = v · B       # shape [batch, d]  — apply B
y  = Wx + Δy     # add to the base result

The intermediate v has batch × r floats — cheap to materialize. Now each matmul only has one adapter-dependent matrix on one side, which is exactly the shape a grouped GEMM can handle.

Part 2 — sort by adapter, run one grouped GEMM per segment. Reorder the batch so all requests using the same adapter sit next to each other. You get a handful of contiguous segments, each one uniform. Fire one kernel per segment that processes every row in that segment against that segment's A (then again for B). 32 requests using 4 unique adapters collapse from 32 launches to 4 launches per projection — one per segment.

This is the "Segmented" part of SGMV. "Gather" refers to the memory access pattern inside the kernel: each thread-block gathers its segment's rows from the flattened batch. "Matrix-Vector" because during decode each request contributes a single token — a vector — per segment.

Pick a preset on the right (all same / 2 popular + 1 rare / all different / realistic mix) or click any request cell to cycle its adapter color. Watch the top row (naive: one lightning bolt per cell) vs the bottom row (SGMV: one lightning bolt per segment). The counter at the bottom says it cleanly: naive N launches → SGMV K launches.

MBGMM vs MBGMV — prefill and decode need different kernels

Recall from the Prefill/Decode Disaggregation module: prefill processes many tokens at once (the whole prompt), while decode processes one token at a time. Same model, same adapters, but the math shape is different.

S-LoRA (the follow-up to Punica) ships two SGMV variants, one per regime:

MBGMM — Multi-Batch Grouped Matrix-Matrix multiplication. The per-request input is prompt_len × d, a matrix. The kernel does batched matrix × matrix per segment. Used for prefill.
MBGMV — Multi-Batch Grouped Matrix-Vector multiplication. The per-request input is 1 × d, a vector. The kernel does batched matrix × vector per segment. Used for decode.

The names are unglamorous but they carry the whole lesson: heterogeneous batching needs a kernel that takes the whole-segment shape into account, and prefill (compute-bound, large tiles) and decode (memory-bound, tiny tiles) want different tile sizes. vLLM, S-LoRA, and LoRAX all ship both.

The numbers

Punica (the original SGMV paper): up to 12x throughput vs naive multi-LoRA serving, with +2 ms/token overhead vs base-only (no LoRA at all). The overhead is the cost of that extra matmul pair.

S-LoRA (Sheng et al., 2023): extends Punica to 2,000 concurrent adapters on one GPU with only a ~5% throughput drop vs serving 5 adapters. Benchmarked at 4x faster than vLLM (pre-LoRA integration) and 30x faster than HuggingFace PEFT on multi-adapter workloads.

The headline: the marginal cost of the 2,000th adapter is almost zero, as long as SGMV is doing the batching.

If you only remember one thing from this step: sort by adapter, then run one grouped GEMM per segment. Every multi-LoRA kernel (Punica's SGMV, S-LoRA's MBGMM/MBGMV, vLLM's bgmv_*, LoRAX's Punica port) is a variation on that idea.

Next: where do those adapters physically live, and what happens when the working set exceeds GPU memory?

Unified Paging: Where Does Your Adapter Live?

How does multi-LoRA memory management work?

An adapter fleet can be huge — thousands of tenants, each with their own ~50 MB adapter. 2,000 adapters at 50 MB each is 100 GB of weight data, far more than a single GPU can hold. But at any instant, only a handful are actively serving requests. The rest are cold. So the serving system uses a 3-tier cache to keep hot adapters close and cold adapters cheap.

Tier	Where it lives	Fetch latency	Typical capacity
GPU HBM	On the GPU, ready to serve	~1 ms (already there)	tens of adapters
CPU RAM	Host memory, one PCIe hop away	~10 ms promote to GPU	hundreds to thousands
Disk / object store	NVMe SSD or S3	100–200 ms cold load	unbounded

A request arriving for adapter X does a cache lookup: if X is on GPU, serve immediately (~1 ms). If X is on CPU, promote it to GPU over PCIe (~10 ms), possibly evicting a cold adapter to make room. If X is on disk, load it into CPU first (~100–200 ms), then promote to GPU. That disk fetch is where cold-adapter TTFT spikes come from — more on that in step 6.

Hit ▶ Play on the right. Try the three traffic pattern toggles:

hot-3 (Zipfian — 80% of traffic on 3 adapters): GPU hit rate stays high, avg latency stays near 1 ms.
uniform (equal traffic across 10 adapters): GPU hits depend on slot count; CPU hits fill the gap.
churning (expanding working set): hits migrate to disk as the working set outgrows GPU and CPU. Watch the avg-latency number climb.

Adjust GPU slots with the + / − buttons to see capacity pressure directly.

PagedAttention, applied to weights

Recall from the KV Cache and Inference Engine modules: PagedAttention treats the KV cache as a pool of fixed-size pages with an allocation table. Instead of reserving one giant contiguous buffer per sequence (most of which sits unused), the engine carves the KV cache into small pages (e.g. 16 tokens each). Each sequence's block_table records which physical pages hold its KV entries, and the attention kernel follows the table to gather the right data. When a sequence ends, its pages go back to the free pool. When a new sequence arrives, pages are allocated on demand. If the pool is full, LRU eviction frees the coldest pages first. The whole point is turning a rigid per-sequence reservation into a flexible shared pool.

Multi-LoRA applies the same idea to adapter weights. The GPU adapter tier is a pool of fixed-size slots (one slot per adapter-worth of weights). A lookup table maps adapter_id → slot_index. When a new adapter is requested and the pool is full, the coldest adapter is evicted back to CPU RAM. Each adapter takes up one slot; same allocation, eviction, and LRU dance, just with bigger pages.

This parallel is why the framing is called unified paging: the same pool-and-eviction machinery can (and should) manage both KV pages and adapter slots out of one memory budget — which brings us to the production gotcha.

LoRAX and TensorRT-LLM

LoRAX (from Predibase) names the 3-tier model explicitly in its docs: adapters live on GPU, CPU, or "dynamic load from disk/Hub." Its eviction policy is LRU, its loader is threaded so a cold fetch doesn't block the request path.

TensorRT-LLM ships a 2-level LoRA cache — GPU + CPU, no separate disk tier (you pre-register all adapters at startup). It's fast where it runs: a hot adapter swap between GPU slots is measured at 1–2 ms, including the PCIe copy. The tradeoff is that the fleet has to be known up front.

The unified-paging gotcha

Multi-LoRA has a subtle memory bug that doesn't exist in single-adapter serving. Here's the setup:

An in-flight sequence has KV cache entries pinned to the GPU.
That sequence was using adapter X.
Adapter X gets evicted (say, because a burst of traffic for adapters Y and Z flooded the pool).
The sequence is still running. Its KV entries are still there. But the adapter those entries were computed against is now... somewhere else.

If the serving engine statically partitions GPU memory into a fixed KV region and a fixed adapter region, those two regions evict independently. Adapter X leaves; the KV entries computed against it stay. They still occupy memory but can't safely be used by any scheduled computation. ExpertWeave (Aug 2025) measured this directly: under a static KV/adapter partition in vLLM-style serving, up to 46.5% of KV cache entries end up invalid — pinned memory with no live adapter to pair them with.

Unified paging means one pool, managed together. KV pages and adapter slots coexist in the same memory budget, and the eviction policy knows that evicting adapter X invalidates the KV pages of sequences using X — so it either evicts those pages too, or refuses to evict X. A static 70% KV / 30% adapter split cannot do this. ExpertWeave's fix, and the direction the field is moving, is to merge the two pools under one policy.

This is the final layer of the paging argument: it's not enough to page adapters like PagedAttention pages KV — you have to page them in the same pool so the eviction policy can keep the pairing consistent.

Next: rank and KV are fighting for the same memory. How much does each cost, and what knobs does vLLM give you?

Rank Tradeoff: Adapters Steal from KV

What does rank cost you?

Every byte of GPU memory that holds a LoRA adapter is a byte that can't hold a KV cache entry. Multi-LoRA doesn't create memory — it redistributes it. Understanding that tradeoff in concrete terms is how you size a production deployment without surprises.

Here's the mental model in one sentence: every GB you spend on adapters is ~2 fewer concurrent users at 2k context on a 7B model.

That "~2 users per GB" is a 7B-at-2k-context rule of thumb from the KV Cache module: one sequence's KV entries run about 0.5 GB at 2k tokens, so each GB of memory is roughly two concurrent sequences. Spend a GB on adapters and you just kicked two users out of the batch. Spend 10 GB on adapters and you lose ~20 users.

Pick a rank pill on the right (r = 8, 16, 32, 64, 128) then click + add adapter to grow the fleet. The pink bar (adapter pool) grows; the indigo bar (KV cache) shrinks 1:1. The big number below — max concurrent sequences — is what that tradeoff costs you.

Worked example

Let's ground this in real numbers. 7B model on an A100 80 GB:

Base model: 14 GB (FP16 weights).
Runtime overhead (activations, workspace, CUDA context): ~2 GB.
Pool available for KV + adapters: ~64 GB.

Now load 20 adapters at r = 16 all-attention:

20 adapters × 50 MB = 1,000 MB ≈ 0.9 GB for adapters
64 GB pool − 0.9 GB = 63.1 GB left for KV
63.1 GB / 0.5 GB per seq ≈ 126 concurrent sequences

Bump those same 20 adapters up to r = 128 all-attention:

20 adapters × 400 MB = 8 GB for adapters
64 − 8 = 56 GB left for KV
56 / 0.5 ≈ 112 concurrent sequences

Dropping from 126 to 112 concurrent sequences is an 11% throughput cut for the rank bump, on a 7B. On a bigger model, or with more adapters, or both, the cut gets steeper fast. The simulator on the right reproduces this math in real time — the big counter should match what you'd work out by hand.

The vLLM knobs

vLLM exposes four flags that govern multi-LoRA behavior. Know what the defaults are and what each one reserves:

Flag	Default	What it does
`--enable-lora`	off	Turn the LoRA path on. Must be set to serve any adapters.
`--max-loras`	1	Max distinct adapters that can appear in a single batch. Reserves SGMV workspace for this many segments. Bump this up if you see adapter contention.
`--max-lora-rank`	user-set	Max rank across all loaded adapters. Workspace is sized to this, so every adapter pays the cost of the largest one. Set it to the actual max rank in your fleet — not higher.
`--max-cpu-loras`	`max_num_seqs`	Max adapters kept in CPU RAM (the second tier from step 4). Defaults to `max_num_seqs`, which is usually fine; raise it if your working set is bigger than your batch size.

Two non-obvious traps baked into those defaults:

--max-loras defaults to 1. Until you change it, you effectively have single-adapter serving. The SGMV kernel is live but it only ever runs one segment.
--max-lora-rank caps the whole fleet at its largest value. A fleet with nineteen r=16 adapters and one r=64 adapter pays r=64 workspace overhead for every request. If that one r=64 is an outlier, consider re-training it smaller or hosting it on a separate pool.

QLoRA composes cleanly

QLoRA — Quantized LoRA (Dettmers et al., 2023) — runs the base model in 4-bit NF4 (from the Quantization module) and keeps LoRA adapters in FP16. That combination composes cleanly on a serving GPU because it solves two orthogonal cost axes:

Quantizing the base attacks the "14 GB base model is too big" problem. 4-bit NF4 drops the 7B base from 14 GB to ~4 GB.
LoRA attacks the "N tenants × N fine-tunes is too expensive" problem. Each per-tenant delta stays tiny regardless of base precision.

You can run a 70B base in NF4 (~35 GB) plus a fleet of FP16 LoRA adapters on a single 80 GB GPU — a deployment that would be impossible with either technique alone. Both vLLM and LoRAX support this composition out of the box.

TP for LoRA — one-liner

For tensor-parallel serving across multiple GPUs, S-LoRA's strategy is: shard B along its output dimension, replicate A on every rank. A is tiny (r × d), so replicating it is cheap. B is big and its output-dim sharding matches the same sharding already used for the base weights, so the all-reduce pattern lines up. Multi-node multi-LoRA is rare in practice, but this is the pattern when you need it.

Next: put all of this into five short diagnostic questions, then on to TensorRT-LLM.