PreFT — Prefill-only LoRA adapters
LLMThe news. On May 16, 2026, Stanford researchers (Lanpouthakoun, Arora, Wu, Pai, Keigwin, Jurafsky, Potts) shared PreFT — the paper had been posted to arXiv two days earlier — alongside reference implementations for LoRA and ReFT on vLLM. The headline result is 1.9× throughput on Llama 3.1 70B serving 512 concurrent adapters, with RL-trained adapters approaching parity and supervised-fine-tuning adapters needing a higher rank to close the gap. The authors' quote on the method is direct: “We only apply the adapter to prefill tokens and discard it afterwards.”
Picture the museum at opening time. A tour guide stands at the entrance and briefs each arriving group — “notice the Vermeer on the right, the Rothko on the second floor.” Once the briefing lands, the guide steps off the floor. The visitors carry the briefing in their heads as they wander the galleries one room at a time. The alternative is the guide walking each visitor through every single room, repeating the same talking points hundreds of times — the rooms become a bottleneck not because they are slow but because the guide is overloaded.
A standard multi-LoRA server is the second museum. For every request — and every decode step inside that request — the engine looks up the request's adapter, folds the low-rank B·A delta into each layer's output, and only then emits the next token. With 512 concurrent adapters the kernel has to dispatch a different B·A per row of the batch (SGMV) on every step. That cost is cheap in the abstract — a low-rank matmul is small compared to the base model — but multiplied by hundreds of decode steps per request times 512 concurrent rows, the adapter overhead eats the serving budget.
PreFT is the first museum. During prefill, the adapter does its full job: each layer's activations get the B·A delta, so the KV cache produced during prefill is already shaped by the adapter. At the prefill / decode boundary, the engine throws the adapter away. Every subsequent decode step is just the unmodified base model attending to KV pairs the adapter already “briefed”. The behavioural fingerprint of the fine-tune is encoded into the cached keys and values that every later decode step reads from anyway — the model never has to recompute the adapter's contribution because it's already baked into what the cache remembers.
The hero animation shows this side by side. Both lanes start identically: input prompt enters, the model runs prefill with the 8 visible adapter chips active (representing 8 of 512 in the batch), and the KV cache fills up with blue bars over the prefill window. At the dashed boundary, the adapter chips on the PreFT lane fade out and the boundary line glows green. From there on, both lanes emit output tokens — but the PreFT lane emits roughly two for every one the Standard lane produces in the same elapsed time, because the Standard lane is still paying the adapter cost on every decode step. The throughput meters at the right of each lane settle at the 1.9× ratio the paper reports.
Where it earns its keep is a worked example with named numbers (illustrative; exact numbers depend on adapter rank, batch composition, and the engine). Picture a serving slice with N=512 active adapters in the batch, a prompt of P=1,024 prefill tokens, and a generation length of D=512 decode tokens. The adapter does its low-rank work P+D = 1,536 forward passes per request, on every layer, in the standard regime. PreFT keeps the P=1,024 prefill passes but zeroes out the D=512 decode passes — and the decode passes are the expensive ones because each decode step is one of D sequential forward passes (vs. prefill which fuses all P prefill tokens into roughly one parallel pass over the prompt). On Llama 3.1 70B with 512 concurrent adapters, the paper reports 1.9× throughput vs. the standard multi-LoRA baseline.
Where PreFT sits next to other adapter-cost levers
| Lever | What it changes | When the adapter runs | Where it earns its keep |
|---|---|---|---|
| Vanilla multi-LoRA | Baseline — every token on every adapter | Prefill and every decode step | simplest path; flexibility to swap adapter mid-stream |
| SGMV-batched decode | Kernel-level efficiency, not adapter-call elimination | Prefill and every decode step (just batched) | flattens the per-row dispatch cost across a mixed batch |
| Unified adapter paging | Adapter weight residency, not active-step cost | Prefill and every decode step (paged in) | fits more adapters in HBM at once |
| PreFT — prefill-only adapter discard (this paper) | Decode-side adapter cost goes to zero | Prefill only — adapter discarded before decode | multi-LoRA workloads dominated by decode; 1.9× throughput at N=512, Llama 3.1 70B |
PreFT slots into the existing multi-LoRA serving stack — the adapter still trains and serves as a normal LoRA or ReFT module, the engine just stops applying it once prefill emits its last KV pair. Because the adapter's contribution lives in the KV cache, in principle the standard cache-side levers — paged attention, prefix caching, chunked prefill — should remain compatible, though the paper itself focuses on the LoRA / ReFT + vLLM result and does not benchmark each combination. There is one new constraint either way: if a request needs the adapter to behave differently mid-stream (e.g. a long-running agent that switches persona at decode step 200), PreFT doesn't support it without a re-prefill, because the adapter is no longer in the loop.
The mental-model shift is that adapter cost is front-loaded into the cache, not paid on every decode step. The paper demonstrates this for LoRA and ReFT specifically; whether other adapter families can be “projected through” the prefilled KV cache the same way is an open question. Multi-LoRA serving was a decode bottleneck for the same reason every memory-bound thing is a decode bottleneck — decode is the long phase. PreFT removes the adapter from that phase entirely.
Goes deeper in: LLM Serving → Multi-LoRA Serving → SGMV-batched decode
Related explainers
- HuggingFace — Async continuous batching — another “the GPU is idle on decode-side bookkeeping” bottleneck, attacked at the scheduler instead of the adapter
- vLLM v0.20 — FlashAttention 4 packing — a decode-side throughput win in the same engine PreFT releases on
- PPOW — window-level RL for speculative drafters — a different decode-side throughput lever: vary how many tokens you draft per step, instead of varying whether the adapter runs