Multi-LoRA Serving
The Cost Crisis
What is multi-LoRA serving?
Multi-LoRA serving means running many LoRA adapters — small fine-tuning deltas — on top of a single shared base model, swapping the active adapter per request. One 14 GB base model can host hundreds of fine-tunes at roughly the cost of one, because each adapter is only tens of megabytes. The rest of this module is about the engineering the serving stack has to do to make that economics actually work.
A LoRA adapter (Low-Rank Adaptation, Hu et al. 2021) is the small weight delta produced by fine-tuning with LoRA instead of full fine-tuning. Think of it as a tiny "personality patch" you clip onto the base model at inference time. A full 7B fine-tune is 14 GB. A LoRA adapter that gives you the same behavior change is typically around 50 MB — roughly 280x smaller.
280× smaller · same behavior change
The tenant cost math
Imagine a SaaS product with N customer tenants, each wanting their own fine-tuned 7B model.
- Full fine-tunes: N separate 14 GB models. On an 80 GB A100 you can fit about 5 before you need a second GPU. 100 tenants is 20 GPUs of base-model weight. 1,000 tenants is 200 GPUs. The cost grows linearly and it grows fast.
- Base + adapters: one 14 GB base once, plus N tiny adapters at ~50 MB each. 100 tenants adds 5 GB of adapters — still one GPU. 1,000 tenants adds 50 GB of adapters — still well within a single 80 GB GPU once you factor in tiered caching (step 4).
This is the core pitch, famously summarized as "hundreds of fine-tuned LLMs for the cost of one" (Predibase/LoRAX). Click the tenant-scale presets on the right (1 / 10 / 100 / 1,000) and watch the two bars diverge. The top bar grows linearly; the bottom bar is almost flat.
The merge trick is wrong for serving
If you've seen LoRA before, you may have seen the "merge" trick: compute W' = W + BA once, replace W with W', and serve the merged model as if it were a regular fine-tune. That works fine if you only ever serve one adapter. It is fatal for multi-tenant serving.
The moment you mutate W into W', the base weights are gone — you can't serve tenant B's adapter off the same GPU without re-loading 14 GB of base. Multi-tenancy requires keeping W pristine and computing the adapter contribution separately at every forward pass:
y = Wx + (BA)x
Every production multi-LoRA system (S-LoRA, Punica, LoRAX, vLLM's LoRA path, TensorRT-LLM's LoRA cache) keeps Wx and (BA)x as two distinct computations and adds the results. The whole rest of this module — SGMV kernels, unified paging, rank tradeoffs — exists because we refuse to merge.
The merged form y = (W + BA)x is what trainers write. The un-merged form y = Wx + (BA)x is what servers run. Same math, but the un-merged form is the only way one GPU can serve many tenants.
Next: we look at what B and A actually are, how big they get, and why the rank r is the knob that controls everything downstream.
LoRA Math at Serving Time
How does LoRA work at inference time?
Every adapted linear layer computes:
y = Wx + (BA)x
Wis the frozen base weight, shaped_out × d_in. For a Llama-7B attention projection,d_in = d_out = 4096.Bis trainable, shaped_out × r.Ais trainable, shaper × d_in.ris the rank — a small integer, typically 8 to 64.
The insight from the original LoRA paper is empirical: the delta produced by fine-tuning is approximately low-rank. You don't need a full d_out × d_in update matrix — a rank-r factorization BA recovers most of the quality at a fraction of the parameter count.
Why it's tiny. W has d × d = 16.7M params per projection. B and A together have 2 × r × d params. At r = 16, that's 131K params — about 0.8% of W. Across a 7B model, well under 1% of total parameters are trainable.
Drag the r slider on the right (4 to 128) and watch the B and A strips grow along their narrow axis. The big W box never moves — that's the whole point.
Adapter size formula
For an FP16 adapter over m target modules, all with dimension d:
adapter_size = 2 × r × d × m × 2 bytes
= 4 × r × d × m bytes
The 2 × r × d counts params in B + A; the trailing × 2 is FP16 (2 bytes per param).
Worked example. Llama-7B, d = 4096, r = 16, 4 target modules (q_proj, k_proj, v_proj, o_proj):
4 × 16 × 4096 × 4 = 1,048,576 bytes per module
× 4 modules = 4,194,304 bytes
≈ 4 MB just for attention projections
With LoRA scaling factors and overhead, real HuggingFace uploads for "all-attention at r=16 on 7B" land around 45–50 MB. Real filenames you can go download:
tloen/alpaca-lora-7b/adapter_model.safetensors— ~17 MB (r=8, q + v only)timdettmers/qlora-alpaca-7b/adapter_model.bin— ~83 MB (r=64, all-attention, the QLoRA paper weights)huggyllama/llama-7b-lora/adapter_model.safetensors— ~97 MB (r=64, all-attention)
Toggle the target-module buttons on the right (q, v only / all-attention / all-linear) to see the count change. Each added module adds its own B/A pair.
Typical sizes across the fleet
| Config | Rank | Target modules | Adapter size (7B) |
|---|---|---|---|
| Minimal | r=8 | q_proj, v_proj | ~8 MB |
| Typical | r=16 | all-attention (q, k, v, o) | ~50 MB |
| Heavy | r=64 | all-linear (+ MLP gate/up/down) | ~300 MB |
Target modules — which linear layers you actually adapt — is the second knob alongside rank. q, v only is the original LoRA paper recommendation and the cheapest setting. all-attention is the modern default and the sweet spot for most tasks. all-linear (attention + MLP) gives the highest quality but 4–6x larger adapters. The right-panel toggle lets you see the size jump for each choice.
Why this matters for serving
Every number in this module flows from the adapter-size formula. Step 3 batches adapters across requests, so "adapter size" sets how much weight data a kernel pulls per segment. Step 4 caches adapters across tiers, so "adapter size" sets your working-set footprint. Step 5 trades adapter pool against KV cache, so "rank times count" is the memory the serving engine reserves up front.
Keep the slider in mind: at r = 16 you're spending ~50 MB per tenant. At r = 128 you're spending ~400 MB per tenant. That 8x is not free, and the next three steps show exactly where it shows up.
Heterogeneous Batches and SGMV
Quick recap — what's in an adapter
From step 2: every LoRA adapter is a pair of small matrices B (d_out × r) and A (r × d_in), applied as (BA)x and added to the frozen base result Wx. r is tiny (typically 16). That's the payload each request brings to the forward pass.
The heterogeneous batch problem
Recall from the Batching module: throughput on modern GPUs comes from batching — stacking many requests' tokens into one big matmul so the base weight read W gets amortized across all of them. 7B base = 14 GB weight read per decode step. 1 request alone = 14 GB for 1 result. 50 requests batched = same 14 GB, 50 results. That's the entire reason continuous batching exists.
Batching works beautifully when every request uses the same weights. Multi-LoRA breaks that assumption: request A uses adapter α, request B uses adapter β, request C uses adapter γ — each one has a different (BA) pair. The base matmul Wx still batches fine (same W for everyone), but the adapter contribution (BA)x seems to need a separate kernel per request.
Naive path: batch size 32 × 32 unique adapters = 32 separate kernel launches per layer, just for the LoRA contribution. On an 80-layer 70B model, that's 2,560 extra kernel launches per token — exactly the launch-overhead disaster the CUDA Graphs module just got you to care about.
What is SGMV (Segmented Gather Matrix-Vector)?
SGMV is the kernel that fixes this. Introduced in the Punica paper (Chen et al., 2023), it's a CUDA kernel that handles a mixed-adapter batch in a small number of launches — one per distinct adapter in the batch, not one per request.
The trick has two parts.
Part 1 — decompose the LoRA path as two matmuls. The raw math is y = x · A · B (per-request A and B). Written as one fused op, the shapes fight each other because each request's A and B are different. Punica's move: split it into two ops with an intermediate of shape r:
v = x · A # shape [batch, r] — apply A
Δy = v · B # shape [batch, d] — apply B
y = Wx + Δy # add to the base result
The intermediate v has batch × r floats — cheap to materialize. Now each matmul only has one adapter-dependent matrix on one side, which is exactly the shape a grouped GEMM can handle.
Part 2 — sort by adapter, run one grouped GEMM per segment. Reorder the batch so all requests using the same adapter sit next to each other. You get a handful of contiguous segments, each one uniform. Fire one kernel per segment that processes every row in that segment against that segment's A (then again for B). 32 requests using 4 unique adapters collapse from 32 launches to 4 launches per projection — one per segment.
This is the "Segmented" part of SGMV. "Gather" refers to the memory access pattern inside the kernel: each thread-block gathers its segment's rows from the flattened batch. "Matrix-Vector" because during decode each request contributes a single token — a vector — per segment.
Pick a preset on the right (all same / 2 popular + 1 rare / all different / realistic mix) or click any request cell to cycle its adapter color. Watch the top row (naive: one lightning bolt per cell) vs the bottom row (SGMV: one lightning bolt per segment). The counter at the bottom says it cleanly: naive N launches → SGMV K launches.
MBGMM vs MBGMV — prefill and decode need different kernels
Recall from the Prefill/Decode Disaggregation module: prefill processes many tokens at once (the whole prompt), while decode processes one token at a time. Same model, same adapters, but the math shape is different.
S-LoRA (the follow-up to Punica) ships two SGMV variants, one per regime:
- MBGMM — Multi-Batch Grouped Matrix-Matrix multiplication. The per-request input is
prompt_len × d, a matrix. The kernel does batched matrix × matrix per segment. Used for prefill. - MBGMV — Multi-Batch Grouped Matrix-Vector multiplication. The per-request input is
1 × d, a vector. The kernel does batched matrix × vector per segment. Used for decode.
The names are unglamorous but they carry the whole lesson: heterogeneous batching needs a kernel that takes the whole-segment shape into account, and prefill (compute-bound, large tiles) and decode (memory-bound, tiny tiles) want different tile sizes. vLLM, S-LoRA, and LoRAX all ship both.
The numbers
Punica (the original SGMV paper): up to 12x throughput vs naive multi-LoRA serving, with +2 ms/token overhead vs base-only (no LoRA at all). The overhead is the cost of that extra matmul pair.
S-LoRA (Sheng et al., 2023): extends Punica to 2,000 concurrent adapters on one GPU with only a ~5% throughput drop vs serving 5 adapters. Benchmarked at 4x faster than vLLM (pre-LoRA integration) and 30x faster than HuggingFace PEFT on multi-adapter workloads.
The headline: the marginal cost of the 2,000th adapter is almost zero, as long as SGMV is doing the batching.
If you only remember one thing from this step: sort by adapter, then run one grouped GEMM per segment. Every multi-LoRA kernel (Punica's SGMV, S-LoRA's MBGMM/MBGMV, vLLM's bgmv_*, LoRAX's Punica port) is a variation on that idea.
Next: where do those adapters physically live, and what happens when the working set exceeds GPU memory?
Unified Paging: Where Does Your Adapter Live?
How does multi-LoRA memory management work?
An adapter fleet can be huge — thousands of tenants, each with their own ~50 MB adapter. 2,000 adapters at 50 MB each is 100 GB of weight data, far more than a single GPU can hold. But at any instant, only a handful are actively serving requests. The rest are cold. So the serving system uses a 3-tier cache to keep hot adapters close and cold adapters cheap.
| Tier | Where it lives | Fetch latency | Typical capacity |
|---|---|---|---|
| GPU HBM | On the GPU, ready to serve | ~1 ms (already there) | tens of adapters |
| CPU RAM | Host memory, one PCIe hop away | ~10 ms promote to GPU | hundreds to thousands |
| Disk / object store | NVMe SSD or S3 | 100–200 ms cold load | unbounded |
A request arriving for adapter X does a cache lookup: if X is on GPU, serve immediately (~1 ms). If X is on CPU, promote it to GPU over PCIe (~10 ms), possibly evicting a cold adapter to make room. If X is on disk, load it into CPU first (~100–200 ms), then promote to GPU. That disk fetch is where cold-adapter TTFT spikes come from — more on that in step 6.
Hit ▶ Play on the right. Try the three traffic pattern toggles:
- hot-3 (Zipfian — 80% of traffic on 3 adapters): GPU hit rate stays high, avg latency stays near 1 ms.
- uniform (equal traffic across 10 adapters): GPU hits depend on slot count; CPU hits fill the gap.
- churning (expanding working set): hits migrate to disk as the working set outgrows GPU and CPU. Watch the avg-latency number climb.
Adjust GPU slots with the + / − buttons to see capacity pressure directly.
PagedAttention, applied to weights
Recall from the KV Cache and Inference Engine modules: PagedAttention treats the KV cache as a pool of fixed-size pages with an allocation table. Instead of reserving one giant contiguous buffer per sequence (most of which sits unused), the engine carves the KV cache into small pages (e.g. 16 tokens each). Each sequence's block_table records which physical pages hold its KV entries, and the attention kernel follows the table to gather the right data. When a sequence ends, its pages go back to the free pool. When a new sequence arrives, pages are allocated on demand. If the pool is full, LRU eviction frees the coldest pages first. The whole point is turning a rigid per-sequence reservation into a flexible shared pool.
Multi-LoRA applies the same idea to adapter weights. The GPU adapter tier is a pool of fixed-size slots (one slot per adapter-worth of weights). A lookup table maps adapter_id → slot_index. When a new adapter is requested and the pool is full, the coldest adapter is evicted back to CPU RAM. Each adapter takes up one slot; same allocation, eviction, and LRU dance, just with bigger pages.
This parallel is why the framing is called unified paging: the same pool-and-eviction machinery can (and should) manage both KV pages and adapter slots out of one memory budget — which brings us to the production gotcha.
LoRAX and TensorRT-LLM
LoRAX (from Predibase) names the 3-tier model explicitly in its docs: adapters live on GPU, CPU, or "dynamic load from disk/Hub." Its eviction policy is LRU, its loader is threaded so a cold fetch doesn't block the request path.
TensorRT-LLM ships a 2-level LoRA cache — GPU + CPU, no separate disk tier (you pre-register all adapters at startup). It's fast where it runs: a hot adapter swap between GPU slots is measured at 1–2 ms, including the PCIe copy. The tradeoff is that the fleet has to be known up front.
The unified-paging gotcha
Multi-LoRA has a subtle memory bug that doesn't exist in single-adapter serving. Here's the setup:
- An in-flight sequence has KV cache entries pinned to the GPU.
- That sequence was using adapter X.
- Adapter X gets evicted (say, because a burst of traffic for adapters Y and Z flooded the pool).
- The sequence is still running. Its KV entries are still there. But the adapter those entries were computed against is now... somewhere else.
If the serving engine statically partitions GPU memory into a fixed KV region and a fixed adapter region, those two regions evict independently. Adapter X leaves; the KV entries computed against it stay. They still occupy memory but can't safely be used by any scheduled computation. ExpertWeave (Aug 2025) measured this directly: under a static KV/adapter partition in vLLM-style serving, up to 46.5% of KV cache entries end up invalid — pinned memory with no live adapter to pair them with.
Unified paging means one pool, managed together. KV pages and adapter slots coexist in the same memory budget, and the eviction policy knows that evicting adapter X invalidates the KV pages of sequences using X — so it either evicts those pages too, or refuses to evict X. A static 70% KV / 30% adapter split cannot do this. ExpertWeave's fix, and the direction the field is moving, is to merge the two pools under one policy.
This is the final layer of the paging argument: it's not enough to page adapters like PagedAttention pages KV — you have to page them in the same pool so the eviction policy can keep the pairing consistent.
Next: rank and KV are fighting for the same memory. How much does each cost, and what knobs does vLLM give you?
Rank Tradeoff: Adapters Steal from KV
What does rank cost you?
Every byte of GPU memory that holds a LoRA adapter is a byte that can't hold a KV cache entry. Multi-LoRA doesn't create memory — it redistributes it. Understanding that tradeoff in concrete terms is how you size a production deployment without surprises.
Here's the mental model in one sentence: every GB you spend on adapters is ~2 fewer concurrent users at 2k context on a 7B model.
That "~2 users per GB" is a 7B-at-2k-context rule of thumb from the KV Cache module: one sequence's KV entries run about 0.5 GB at 2k tokens, so each GB of memory is roughly two concurrent sequences. Spend a GB on adapters and you just kicked two users out of the batch. Spend 10 GB on adapters and you lose ~20 users.
Pick a rank pill on the right (r = 8, 16, 32, 64, 128) then click + add adapter to grow the fleet. The pink bar (adapter pool) grows; the indigo bar (KV cache) shrinks 1:1. The big number below — max concurrent sequences — is what that tradeoff costs you.
Worked example
Let's ground this in real numbers. 7B model on an A100 80 GB:
- Base model: 14 GB (FP16 weights).
- Runtime overhead (activations, workspace, CUDA context): ~2 GB.
- Pool available for KV + adapters: ~64 GB.
Now load 20 adapters at r = 16 all-attention:
20 adapters × 50 MB = 1,000 MB ≈ 0.9 GB for adapters
64 GB pool − 0.9 GB = 63.1 GB left for KV
63.1 GB / 0.5 GB per seq ≈ 126 concurrent sequences
Bump those same 20 adapters up to r = 128 all-attention:
20 adapters × 400 MB = 8 GB for adapters
64 − 8 = 56 GB left for KV
56 / 0.5 ≈ 112 concurrent sequences
Dropping from 126 to 112 concurrent sequences is an 11% throughput cut for the rank bump, on a 7B. On a bigger model, or with more adapters, or both, the cut gets steeper fast. The simulator on the right reproduces this math in real time — the big counter should match what you'd work out by hand.
The vLLM knobs
vLLM exposes four flags that govern multi-LoRA behavior. Know what the defaults are and what each one reserves:
| Flag | Default | What it does |
|---|---|---|
--enable-lora | off | Turn the LoRA path on. Must be set to serve any adapters. |
--max-loras | 1 | Max distinct adapters that can appear in a single batch. Reserves SGMV workspace for this many segments. Bump this up if you see adapter contention. |
--max-lora-rank | user-set | Max rank across all loaded adapters. Workspace is sized to this, so every adapter pays the cost of the largest one. Set it to the actual max rank in your fleet — not higher. |
--max-cpu-loras | max_num_seqs | Max adapters kept in CPU RAM (the second tier from step 4). Defaults to max_num_seqs, which is usually fine; raise it if your working set is bigger than your batch size. |
Two non-obvious traps baked into those defaults:
--max-lorasdefaults to 1. Until you change it, you effectively have single-adapter serving. The SGMV kernel is live but it only ever runs one segment.--max-lora-rankcaps the whole fleet at its largest value. A fleet with nineteenr=16adapters and oner=64adapter paysr=64workspace overhead for every request. If that oner=64is an outlier, consider re-training it smaller or hosting it on a separate pool.
QLoRA composes cleanly
QLoRA — Quantized LoRA (Dettmers et al., 2023) — runs the base model in 4-bit NF4 (from the Quantization module) and keeps LoRA adapters in FP16. That combination composes cleanly on a serving GPU because it solves two orthogonal cost axes:
- Quantizing the base attacks the "14 GB base model is too big" problem. 4-bit NF4 drops the 7B base from 14 GB to ~4 GB.
- LoRA attacks the "N tenants × N fine-tunes is too expensive" problem. Each per-tenant delta stays tiny regardless of base precision.
You can run a 70B base in NF4 (~35 GB) plus a fleet of FP16 LoRA adapters on a single 80 GB GPU — a deployment that would be impossible with either technique alone. Both vLLM and LoRAX support this composition out of the box.
TP for LoRA — one-liner
For tensor-parallel serving across multiple GPUs, S-LoRA's strategy is: shard B along its output dimension, replicate A on every rank. A is tiny (r × d), so replicating it is cheap. B is big and its output-dim sharding matches the same sharding already used for the base weights, so the all-reduce pattern lines up. Multi-node multi-LoRA is rare in practice, but this is the pattern when you need it.
Next: put all of this into five short diagnostic questions, then on to TensorRT-LLM.