What is PEFT scaling about?

It is a position-and-systems paper that reframes parameter-efficient fine-tuning (PEFT), such as LoRA, from a cost-cutting substitute for full fine-tuning into a substrate for persistent personal models. A small adapter stores a user's preferences, skills, and memory-like updates while one shared frozen base supplies general competence. It organizes the idea along three axes — scale up (the base), scale down (the adapter), and scale out (the instance count) — and proposes MinT, infrastructure to manage roughly 1,000,000 personal adapters on a ~1-trillion-parameter base.

Why does treating adapters as persistent personal state matter?

Because it changes the unit of personalization from a handful of disposable adapters to millions of durable ones. A persistent adapter needs an identity, a revision history, provenance, and a place to live when idle — concerns that don't arise when an adapter is a throwaway fine-tune. It points at a future where every user gets their own model behavior without anyone storing or serving a separate full model per user: the ~1T base is never copied, only a tiny adapter is.

How is this different from just using LoRA to fine-tune cheaply?

Cheap LoRA fine-tuning trains an adapter for a task and is done. This paper keeps the adapter alive as personal state and asks the systems question that follows: how do you store, version, audit, and serve a million of them at once? That is the role of MinT, and serving residency — paging cold adapters into GPU memory on demand — is its sharpest constraint, because most of a million adapters are inactive at any given moment.

PEFT scaling paper — Persistent personal adapters at million-scale

PEFT scaling — a million personal adapters on one frozen base

LLM

learnaivisually.com/ai-explained/peft-scaling-persistent-personal-adapters

TL;DR

What is it: A position-and-systems paper on PEFT scaling reframes a parameter-efficient fine-tuning (LoRA-style) adapter as persistent per-user state, and sketches MinT — infrastructure to serve roughly 1,000,000 personal adapters over one frozen ~1-trillion-parameter base.
Why it’s needed: It moves the unit of personalization from a handful of throwaway adapters to millions of durable ones, each with identity, revision, provenance, and serving residency — a concrete future-state for multi-tenant LLM serving where every user gets their own model behavior without their own model.
vs previous: Where PEFT is usually sold as a cheap substitute for full fine-tuning, this treats the adapter as permanent personal memory on a shared base: the ~1T base is never copied per user — only a tiny adapter is stored, versioned, and paged in.

Jargon

PEFT: Parameter-Efficient Fine-Tuning. Adapt a model by training a small set of new weights while the original weights stay frozen, instead of updating all of them.
LoRA: Low-Rank Adaptation, the most common PEFT method: it learns a tiny low-rank "delta" matrix per layer that nudges the frozen weights. See the LoRA serving math →
Adapter: The small trainable delta a PEFT method produces — typically tens of megabytes, versus tens of gigabytes for the base. It is added on top of the frozen weights at inference time.
Full fine-tuning: The classic approach: update every weight in the model for the new task. It produces a complete model copy per use case, which does not scale to millions of users.
MinT: The paper's proposed infrastructure layer for managing adapter identity, revision, provenance, evaluation, and serving residency at the scale of millions of coexisting adapters.
Serving residency: Where an adapter physically lives at request time — resident in GPU memory (fast) or paged out to slower storage (cheap). With millions of adapters, most are cold and must be paged in on demand. Where your adapter lives →
Provenance: The recorded history of an adapter — what data trained it, which base version it targets, and how it has been revised — so a personal adapter can be audited and trusted over time.

The news. On June 1, 2026, a position-and-systems paper on PEFT scaling (arXiv:2606.02437) argued that parameter-efficient fine-tuning should be treated not as a budget substitute for full fine-tuning but as a substrate for persistent personal models: small trainable adapters that store per-user preferences, skills, tool habits, and memory-like updates, while one shared foundation model supplies general competence. It organizes the problem along three axes — scale up (a stronger shared base), scale down (how minimal an adapter can be), and scale out (millions of persistent instances coexisting) — and proposes MinT, an infrastructure design for managing adapter identity, revision, provenance, and serving residency at the scale of ~1,000,000 adapters on a ~1-trillion-parameter base. Read the paper →

Picture a public library with one giant shared encyclopedia — printed once, kept on a reading stand, never rewritten. Every reader who walks in gets a thin personal sticky-note booklet: their own annotations, corrections, and dog-eared shortcuts, clipped onto the same shared volume. The library does not reprint the encyclopedia for each reader — that would be absurd. It keeps the one book and a drawer of a million thin booklets. That is precisely the arrangement this paper argues for: a single frozen shared base plus a personal adapter per user, where the base is the encyclopedia and your adapter is the booklet that makes the answers yours.

The usual pitch for LoRA is a cost-cutting trick — a cheaper way to fine-tune when you can't afford to touch all the weights, born from the same pressure that drives the cost crisis behind serving many adapters. This paper flips the framing: the adapter is not a discount fine-tune you train and discard, it is persistent personal state — preferences, learned skills, and memory-like updates that live on top of the base indefinitely. The reframe matters because it changes what the adapter is for. A discount fine-tune is disposable; personal state is permanent, which means it needs an identity, a version history, and a place to live when it isn't being used.

The three scaling axes

Axis	What scales	The lever
Scale up	The shared base	A stronger ~1T base means each small adapter delta does more work, so less must be learned per user
Scale down	The adapter	How minimal an adapter can be while staying reliable — the booklet, not the encyclopedia (see rank vs. capacity)
Scale out	The instance count	~1,000,000 persistent adapters coexisting (the paper's stated target), managed by MinT infrastructure

The three axes are the paper's own organizing frame; the ~1T base and ~1,000,000-adapter figures are its stated targets, not measured deployments.

At a million coexisting adapters, the hard problems stop being about training and start being about bookkeeping. Which booklet is yours? Which base version was it clipped onto? When you revise it, what was the old version, and who is allowed to read it? And — the serving question — is your booklet on the reading desk right now, or is it filed in the back room? The paper bundles these under MinT: identity, revision, provenance, evaluation, and serving residency. Residency is the sharpest of these, because most of a million adapters are cold at any instant and must be paged into GPU memory on demand — exactly the question of where each adapter physically lives at serving time.

The reason this is even thinkable is that an adapter is small in a way that a full copy never is:

Both bars drawn at true 1:280 scale

Full 7B fine-tune

14 GB

14,000 MB

LoRA adapter (r=16)

50 MB

↑ that tiny indigo sliver on the left is the adapter

280× smaller · same behavior change

Walk the storage budget with round numbers (illustrative). Suppose you wanted to personalize 1,000,000 users the old way — a full fine-tune per user. A ~1T-parameter base at 2 bytes per parameter is about 2 TB per copy, so a million copies is ~2,000,000 TB ≈ 2 EB — the kind of number that ends the conversation. Now do it with adapters: a single personal adapter of, say, ~20 MB times a million users is ~20,000,000 MB ≈ 20 TB. Same fleet of personalized users, 2 EB → 20 TB — roughly 100,000× less storage. Under this fixed-adapter assumption the gap widens as the base grows, since the encyclopedia gets bigger while the booklet stays about the same size — the "scale up" and "scale down" axes pulling in the same direction, and why a million personal models is an infrastructure problem rather than a fantasy. The same shrink-the-numbers instinct shows up across serving and the size problem that motivates quantization.

Goes deeper in: LLM Serving → Multi-LoRA Serving → Unified Paging: Where Does Your Adapter Live?

Related explainers

PreFT — prefill-only LoRA adapters — where in the forward pass an adapter applies, the serving-cost flip side of where it lives.
Parametric Memory Law — LoRA capacity vs. verbatim recall — how much a small adapter can actually store, which sets the "scale down" floor.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based