PEFT scaling — a million personal adapters on one frozen base
LLMThe news. On June 1, 2026, a position-and-systems paper on PEFT scaling (arXiv:2606.02437) argued that parameter-efficient fine-tuning should be treated not as a budget substitute for full fine-tuning but as a substrate for persistent personal models: small trainable adapters that store per-user preferences, skills, tool habits, and memory-like updates, while one shared foundation model supplies general competence. It organizes the problem along three axes — scale up (a stronger shared base), scale down (how minimal an adapter can be), and scale out (millions of persistent instances coexisting) — and proposes MinT, an infrastructure design for managing adapter identity, revision, provenance, and serving residency at the scale of ~1,000,000 adapters on a ~1-trillion-parameter base. Read the paper →
Picture a public library with one giant shared encyclopedia — printed once, kept on a reading stand, never rewritten. Every reader who walks in gets a thin personal sticky-note booklet: their own annotations, corrections, and dog-eared shortcuts, clipped onto the same shared volume. The library does not reprint the encyclopedia for each reader — that would be absurd. It keeps the one book and a drawer of a million thin booklets. That is precisely the arrangement this paper argues for: a single frozen shared base plus a personal adapter per user, where the base is the encyclopedia and your adapter is the booklet that makes the answers yours.
The usual pitch for LoRA is a cost-cutting trick — a cheaper way to fine-tune when you can't afford to touch all the weights, born from the same pressure that drives the cost crisis behind serving many adapters. This paper flips the framing: the adapter is not a discount fine-tune you train and discard, it is persistent personal state — preferences, learned skills, and memory-like updates that live on top of the base indefinitely. The reframe matters because it changes what the adapter is for. A discount fine-tune is disposable; personal state is permanent, which means it needs an identity, a version history, and a place to live when it isn't being used.
The three scaling axes
| Axis | What scales | The lever |
|---|---|---|
| Scale up | The shared base | A stronger ~1T base means each small adapter delta does more work, so less must be learned per user |
| Scale down | The adapter | How minimal an adapter can be while staying reliable — the booklet, not the encyclopedia (see rank vs. capacity) |
| Scale out | The instance count | ~1,000,000 persistent adapters coexisting (the paper's stated target), managed by MinT infrastructure |
The three axes are the paper's own organizing frame; the ~1T base and ~1,000,000-adapter figures are its stated targets, not measured deployments.
At a million coexisting adapters, the hard problems stop being about training and start being about bookkeeping. Which booklet is yours? Which base version was it clipped onto? When you revise it, what was the old version, and who is allowed to read it? And — the serving question — is your booklet on the reading desk right now, or is it filed in the back room? The paper bundles these under MinT: identity, revision, provenance, evaluation, and serving residency. Residency is the sharpest of these, because most of a million adapters are cold at any instant and must be paged into GPU memory on demand — exactly the question of where each adapter physically lives at serving time.
The reason this is even thinkable is that an adapter is small in a way that a full copy never is:
280× smaller · same behavior change
Walk the storage budget with round numbers (illustrative). Suppose you wanted to personalize 1,000,000 users the old way — a full fine-tune per user. A ~1T-parameter base at 2 bytes per parameter is about 2 TB per copy, so a million copies is ~2,000,000 TB ≈ 2 EB — the kind of number that ends the conversation. Now do it with adapters: a single personal adapter of, say, ~20 MB times a million users is ~20,000,000 MB ≈ 20 TB. Same fleet of personalized users, 2 EB → 20 TB — roughly 100,000× less storage. Under this fixed-adapter assumption the gap widens as the base grows, since the encyclopedia gets bigger while the booklet stays about the same size — the "scale up" and "scale down" axes pulling in the same direction, and why a million personal models is an infrastructure problem rather than a fantasy. The same shrink-the-numbers instinct shows up across serving and the size problem that motivates quantization.
Goes deeper in: LLM Serving → Multi-LoRA Serving → Unified Paging: Where Does Your Adapter Live?
Related explainers
- PreFT — prefill-only LoRA adapters — where in the forward pass an adapter applies, the serving-cost flip side of where it lives.
- Parametric Memory Law — LoRA capacity vs. verbatim recall — how much a small adapter can actually store, which sets the "scale down" floor.