PEFT scaling — a million personal adapters on one frozen base

LLM
L
One frozen base, a million personal adapters — not a million model copiesFrozen shared base~1T params · trained oncenever copied per userpersonal adapters — one tiny delta per usergrid = representative sample · actual count at rightadapters served6personal adaptertoday: disposableresident (hot)paged (cold)Storage to personalize 1,000,000 users (illustrative):full fine-tune / user≈ 2 EB ⚠ infeasibleadapter / user≈ 20 TB · ~100,000× smaller1 frozen base + 1,000,000 personal adapters2 EB20 TBfull copies per user → one tiny adapter per user (illustrative)MinT: identity · revision · provenance · serving residency
learnaivisually.com/ai-explained/peft-scaling-persistent-personal-adapters

The news. On June 1, 2026, a position-and-systems paper on PEFT scaling (arXiv:2606.02437) argued that parameter-efficient fine-tuning should be treated not as a budget substitute for full fine-tuning but as a substrate for persistent personal models: small trainable adapters that store per-user preferences, skills, tool habits, and memory-like updates, while one shared foundation model supplies general competence. It organizes the problem along three axes — scale up (a stronger shared base), scale down (how minimal an adapter can be), and scale out (millions of persistent instances coexisting) — and proposes MinT, an infrastructure design for managing adapter identity, revision, provenance, and serving residency at the scale of ~1,000,000 adapters on a ~1-trillion-parameter base. Read the paper →

Picture a public library with one giant shared encyclopedia — printed once, kept on a reading stand, never rewritten. Every reader who walks in gets a thin personal sticky-note booklet: their own annotations, corrections, and dog-eared shortcuts, clipped onto the same shared volume. The library does not reprint the encyclopedia for each reader — that would be absurd. It keeps the one book and a drawer of a million thin booklets. That is precisely the arrangement this paper argues for: a single frozen shared base plus a personal adapter per user, where the base is the encyclopedia and your adapter is the booklet that makes the answers yours.

The usual pitch for LoRA is a cost-cutting trick — a cheaper way to fine-tune when you can't afford to touch all the weights, born from the same pressure that drives the cost crisis behind serving many adapters. This paper flips the framing: the adapter is not a discount fine-tune you train and discard, it is persistent personal state — preferences, learned skills, and memory-like updates that live on top of the base indefinitely. The reframe matters because it changes what the adapter is for. A discount fine-tune is disposable; personal state is permanent, which means it needs an identity, a version history, and a place to live when it isn't being used.

The three scaling axes

AxisWhat scalesThe lever
Scale upThe shared baseA stronger ~1T base means each small adapter delta does more work, so less must be learned per user
Scale downThe adapterHow minimal an adapter can be while staying reliable — the booklet, not the encyclopedia (see rank vs. capacity)
Scale outThe instance count~1,000,000 persistent adapters coexisting (the paper's stated target), managed by MinT infrastructure

The three axes are the paper's own organizing frame; the ~1T base and ~1,000,000-adapter figures are its stated targets, not measured deployments.

At a million coexisting adapters, the hard problems stop being about training and start being about bookkeeping. Which booklet is yours? Which base version was it clipped onto? When you revise it, what was the old version, and who is allowed to read it? And — the serving question — is your booklet on the reading desk right now, or is it filed in the back room? The paper bundles these under MinT: identity, revision, provenance, evaluation, and serving residency. Residency is the sharpest of these, because most of a million adapters are cold at any instant and must be paged into GPU memory on demand — exactly the question of where each adapter physically lives at serving time.

The reason this is even thinkable is that an adapter is small in a way that a full copy never is:

Both bars drawn at true 1:280 scale
Full 7B fine-tune
14 GB
14,000 MB
LoRA adapter (r=16)
50 MB
↑ that tiny indigo sliver on the left is the adapter

280× smaller · same behavior change

Walk the storage budget with round numbers (illustrative). Suppose you wanted to personalize 1,000,000 users the old way — a full fine-tune per user. A ~1T-parameter base at 2 bytes per parameter is about 2 TB per copy, so a million copies is ~2,000,000 TB ≈ 2 EB — the kind of number that ends the conversation. Now do it with adapters: a single personal adapter of, say, ~20 MB times a million users is ~20,000,000 MB ≈ 20 TB. Same fleet of personalized users, 2 EB → 20 TB — roughly 100,000× less storage. Under this fixed-adapter assumption the gap widens as the base grows, since the encyclopedia gets bigger while the booklet stays about the same size — the "scale up" and "scale down" axes pulling in the same direction, and why a million personal models is an infrastructure problem rather than a fantasy. The same shrink-the-numbers instinct shows up across serving and the size problem that motivates quantization.

Goes deeper in: LLM Serving → Multi-LoRA Serving → Unified Paging: Where Does Your Adapter Live?

Related explainers

Frequently Asked Questions