PreFT ("Prefill-only Fine-Tuning", Stanford, May 2026) is a multi-LoRA serving technique that applies the LoRA (or ReFT) adapter only during prefill and discards it before decode begins. The adapter shapes the KV cache during prefill; every subsequent decode step then runs the bare base model and reads from that KV cache, so the adapter's behavioural contribution is "remembered" without re-applying the adapter on every step. The authors release implementations for LoRA and ReFT on vLLM and measure 1.9× throughput on Llama 3.1 70B serving 512 concurrent adapters.

Why does PreFT speed up multi-LoRA serving?

Multi-LoRA serving is bottlenecked by decode, not prefill — decode emits hundreds of tokens per request and each step has to fold in the per-request adapter delta (B·A) on every layer. With 512 concurrent adapters in a batch, the engine dispatches a different B·A per row on every decode step (SGMV). PreFT zeroes out that cost: prefill still does its adapter work (and runs once per request, in a parallel pass over the prompt), but every decode step skips the adapter entirely and runs the base model. The savings concentrate exactly where the work is — sequential, memory-bandwidth-bound decode.

How does PreFT relate to prefill/decode disaggregation?

They target different aspects of the same prefill/decode split. Prefill/decode disaggregation runs the two phases on physically separate GPU pools to stop long prefills from blocking low-latency decodes, paying a cross-machine KV transfer to do so. PreFT keeps both phases on the same engine but changes what the adapter does in each: prefill applies it, decode skips it. The two are complementary — a server can disaggregate prefill from decode AND apply PreFT, with the decode pool simply not loading adapter weights at all.

PreFT applies LoRA only to prefill — Prefill-only LoRA adapters

PreFT — Prefill-only LoRA adapters

LLM

learnaivisually.com/ai-explained/preft-prefill-only-adapters

TL;DR

What is it: The PreFT paper from Stanford (Lanpouthakoun et al.) applies the LoRA adapter only during prefill and then drops it before decode begins. The adapter's behavioural signal is baked into the prefilled KV cache, so the rest of the generation runs the bare base model.
Why it’s needed: Multi-LoRA serving is bottlenecked by decode, not prefill — decode emits hundreds to thousands of tokens per request, each one a forward pass that has to fold in adapter weights. Stripping the adapter from decode collapses that per-step cost across the whole serving fleet.
vs previous: Standard multi-LoRA (SGMV / unified paging) keeps the adapter active for every token, prefill and decode alike — the adapter delta is recomputed on every decode step. PreFT cuts that to a one-time cost during prefill.

Jargon

LoRA: Low-Rank Adaptation. Instead of fine-tuning the base weights, train two small matrices A (rank-r down-projection) and B (rank-r up-projection) whose product B·A adds a low-rank delta to each frozen weight matrix. The full picture lives in LLM Serving → Multi-LoRA → LoRA math.
Multi-LoRA serving: Hosting many LoRA adapters on top of a single shared base model so each request can pick which adapter to apply at runtime. Engines like vLLM batch requests across adapters in one forward pass.
Prefill: The first forward pass over the input prompt — produces the initial KV cache. Compute-bound and parallel across prompt tokens. See KV Cache → Prefill.
Decode: The token-at-a-time phase that emits each new output token after prefill. Memory-bandwidth-bound and sequential. Each decode step is a single forward pass over the latest token.
KV cache: The keys and values cached from every prior token so attention does not recompute them every step. Anything that shaped those KV pairs during prefill is “remembered” by every subsequent decode step. See KV Cache → Intro.
SGMV: Segmented Gather Matrix-Vector — the kernel used by Punica / vLLM to apply one of many LoRA adapters to each request in a mixed batch. The cost is paid on every token whose request uses an adapter — including every decode step.
ReFT: Representation Fine-Tuning — an alternative to LoRA that edits a model's hidden representations directly via small additive interventions. PreFT's prefill-only trick is shown to work for ReFT too, not just LoRA.
Adapter delta: The B·A·x additive contribution a LoRA (or ReFT) adapter applies to each layer's output. Cheap per token in raw FLOPs, but multiplied by the number of decode tokens and the number of concurrent adapters in the batch, it becomes the serving bottleneck.

The news. On May 16, 2026, Stanford researchers (Lanpouthakoun, Arora, Wu, Pai, Keigwin, Jurafsky, Potts) shared PreFT — the paper had been posted to arXiv two days earlier — alongside reference implementations for LoRA and ReFT on vLLM. The headline result is 1.9× throughput on Llama 3.1 70B serving 512 concurrent adapters, with RL-trained adapters approaching parity and supervised-fine-tuning adapters needing a higher rank to close the gap. The authors' quote on the method is direct: “We only apply the adapter to prefill tokens and discard it afterwards.”

Picture the museum at opening time. A tour guide stands at the entrance and briefs each arriving group — “notice the Vermeer on the right, the Rothko on the second floor.” Once the briefing lands, the guide steps off the floor. The visitors carry the briefing in their heads as they wander the galleries one room at a time. The alternative is the guide walking each visitor through every single room, repeating the same talking points hundreds of times — the rooms become a bottleneck not because they are slow but because the guide is overloaded.

A standard multi-LoRA server is the second museum. For every request — and every decode step inside that request — the engine looks up the request's adapter, folds the low-rank B·A delta into each layer's output, and only then emits the next token. With 512 concurrent adapters the kernel has to dispatch a different B·A per row of the batch (SGMV) on every step. That cost is cheap in the abstract — a low-rank matmul is small compared to the base model — but multiplied by hundreds of decode steps per request times 512 concurrent rows, the adapter overhead eats the serving budget.

PreFT is the first museum. During prefill, the adapter does its full job: each layer's activations get the B·A delta, so the KV cache produced during prefill is already shaped by the adapter. At the prefill / decode boundary, the engine throws the adapter away. Every subsequent decode step is just the unmodified base model attending to KV pairs the adapter already “briefed”. The behavioural fingerprint of the fine-tune is encoded into the cached keys and values that every later decode step reads from anyway — the model never has to recompute the adapter's contribution because it's already baked into what the cache remembers.

The hero animation shows this side by side. Both lanes start identically: input prompt enters, the model runs prefill with the 8 visible adapter chips active (representing 8 of 512 in the batch), and the KV cache fills up with blue bars over the prefill window. At the dashed boundary, the adapter chips on the PreFT lane fade out and the boundary line glows green. From there on, both lanes emit output tokens — but the PreFT lane emits roughly two for every one the Standard lane produces in the same elapsed time, because the Standard lane is still paying the adapter cost on every decode step. The throughput meters at the right of each lane settle at the 1.9× ratio the paper reports.

Where it earns its keep is a worked example with named numbers (illustrative; exact numbers depend on adapter rank, batch composition, and the engine). Picture a serving slice with N=512 active adapters in the batch, a prompt of P=1,024 prefill tokens, and a generation length of D=512 decode tokens. The adapter does its low-rank work P+D = 1,536 forward passes per request, on every layer, in the standard regime. PreFT keeps the P=1,024 prefill passes but zeroes out the D=512 decode passes — and the decode passes are the expensive ones because each decode step is one of D sequential forward passes (vs. prefill which fuses all P prefill tokens into roughly one parallel pass over the prompt). On Llama 3.1 70B with 512 concurrent adapters, the paper reports 1.9× throughput vs. the standard multi-LoRA baseline.

Where PreFT sits next to other adapter-cost levers

Lever	What it changes	When the adapter runs	Where it earns its keep
Vanilla multi-LoRA	Baseline — every token on every adapter	Prefill and every decode step	simplest path; flexibility to swap adapter mid-stream
SGMV-batched decode	Kernel-level efficiency, not adapter-call elimination	Prefill and every decode step (just batched)	flattens the per-row dispatch cost across a mixed batch
Unified adapter paging	Adapter weight residency, not active-step cost	Prefill and every decode step (paged in)	fits more adapters in HBM at once
PreFT — prefill-only adapter discard (this paper)	Decode-side adapter cost goes to zero	Prefill only — adapter discarded before decode	multi-LoRA workloads dominated by decode; 1.9× throughput at N=512, Llama 3.1 70B

PreFT slots into the existing multi-LoRA serving stack — the adapter still trains and serves as a normal LoRA or ReFT module, the engine just stops applying it once prefill emits its last KV pair. Because the adapter's contribution lives in the KV cache, in principle the standard cache-side levers — paged attention, prefix caching, chunked prefill — should remain compatible, though the paper itself focuses on the LoRA / ReFT + vLLM result and does not benchmark each combination. There is one new constraint either way: if a request needs the adapter to behave differently mid-stream (e.g. a long-running agent that switches persona at decode step 200), PreFT doesn't support it without a re-prefill, because the adapter is no longer in the loop.

The mental-model shift is that adapter cost is front-loaded into the cache, not paid on every decode step. The paper demonstrates this for LoRA and ReFT specifically; whether other adapter families can be “projected through” the prefilled KV cache the same way is an open question. Multi-LoRA serving was a decode bottleneck for the same reason every memory-bound thing is a decode bottleneck — decode is the long phase. PreFT removes the adapter from that phase entirely.

Goes deeper in: LLM Serving → Multi-LoRA Serving → SGMV-batched decode

Related explainers

HuggingFace — Async continuous batching — another “the GPU is idle on decode-side bookkeeping” bottleneck, attacked at the scheduler instead of the adapter
vLLM v0.20 — FlashAttention 4 packing — a decode-side throughput win in the same engine PreFT releases on
PPOW — window-level RL for speculative drafters — a different decode-side throughput lever: vary how many tokens you draft per step, instead of varying whether the adapter runs

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based