What is the Parametric Memory Law, in one paragraph?

It is a power law, measured with LoRA as a probe, that links how much a finetune memorizes to two knobs: the effective parameters the adapter moves and the training sequence length. On a log-log plot it is a straight line, so memorization shows diminishing returns — each doubling of capacity buys the same fixed step of recall. The paper pairs this with a token-level finding: verbatim recall of a token is guaranteed once its greedy probability passes 0.5, making recall a phase transition rather than a smooth curve.

Why does the p > 0.5 threshold matter?

Because it turns memorization from a vague spectrum into a countable yes/no per token. Crossing 0.5 guarantees the model reproduces the exact token under greedy decoding; below 0.5 recall is no longer guaranteed — the token may still come out right, but it is no longer a sure thing. That sharp cliff means adding capacity flips tokens into recall one at a time, in order of difficulty, instead of nudging every token a little. It also tells you where training effort is wasted — on tokens already well above or hopelessly below the line — and where it pays off, on tokens hovering just under 0.5.

How does it relate to LoRA and Multi-LoRA serving?

LoRA is the measuring instrument: its single rank knob sets the effective parameters, so sweeping rank traces out the law. The result feeds straight into Multi-LoRA serving decisions — it quantifies how much an adapter of a given rank can actually memorize, which informs the rank-versus-quality tradeoff when you pack many adapters into one serving stack. MemFT, the paper's training tweak, then reallocates capacity toward near-threshold tokens so a fixed rank flips the most tokens into verbatim recall.

Parametric Memory Law links LoRA capacity to verbatim recall — The p > 0.5 recall threshold

Parametric Memory Law — the p > 0.5 recall threshold

LLM

learnaivisually.com/ai-explained/parametric-memory-law-verbatim-recall

TL;DR

What is it: The Parametric Memory Law paper uses LoRA as a probe to measure how much a finetune memorizes, and finds that verbatim recall of a token switches on the moment its greedy probability passes 0.5.
Why it’s needed: It turns the fuzzy question "how much can a small adapter memorize?" into a measurable power law: capacity (effective parameters × sequence length) predicts how many tokens lock in — which matters anywhere you finetune with LoRA or budget a Multi-LoRA serving stack.
vs previous: Earlier intuition treated memorization as a smooth curve that fills in gradually with more training; this paper shows it is a per-token phase transition — crossing 0.5 makes a token's recall a sure thing, while below it recall isn't guaranteed — so adding capacity flips tokens one at a time rather than nudging them all.

Jargon

LoRA (Low-Rank Adaptation): A finetuning method that freezes the base weights and trains a small low-rank add-on (two skinny matrices) per layer. Because the add-on's size is set by a single rank knob, it doubles as a clean dial for "how much capacity did we add?" Background: Quantization → Modern methods (QLoRA).
Effective parameters: Not the model's total parameter count, but the number of weights a given finetune actually moves. A LoRA of rank r on a d×d matrix has roughly 2·d·r effective parameters — far fewer than the d² it sits on top of.
Greedy decoding: Always emitting the single highest-probability next token. The paper measures recall under greedy decoding, so a token's greedy probability is just the model's confidence in the one word it would actually produce. See Text Generation → Greedy decoding.
Verbatim recall: Reproducing a training sequence token-for-token, not a paraphrase with the same meaning. This is the behaviour the law predicts and the thing privacy and memorization studies care about.
Power law (scaling law): A relationship where one quantity grows as a fixed power of another (y ∝ x^α). Plotted on log-log axes it's a straight line. Here, memorized content scales as a power of effective parameters and sequence length.
Phase transition: A sharp, threshold-crossing change in state — like water snapping to ice at 0 °C — rather than a smooth ramp. The paper's central claim is that verbatim recall is a phase transition at p = 0.5, not a gradual fade-in.
MemFT: The paper's proposed finetuning tweak: instead of spreading optimization evenly, it reallocates training toward tokens still sitting below 0.5 — spending capacity where it will actually flip a token into recall.
Parametric memory: Knowledge stored in the weights themselves (parameters), as opposed to retrieved from a context window or an external store at inference time.

The news. On May 29, 2026, the Parametric Memory Law paper used LoRA as a controlled probe to quantify how much memory a finetune installs in a model's weights. By sweeping the LoRA rank and the training sequence length, the authors fit a power law linking the two to how much content gets memorized, and they pin down a striking detail: a token is recalled verbatim once its greedy prediction probability passes 0.5. They package the finding into MemFT, a recipe that steers training toward the tokens still under that threshold.

Picture the metaphor for a moment. You're cramming a poem the night before a recital. Early on you can gist every line — you know roughly what each one says — but if someone stops you mid-verse you paraphrase. Then, after enough rehearsal, a line clicks: you can suddenly recite it word-for-word, and it stays locked. The click isn't gradual. One pass you're approximating, the next pass it's verbatim. The Parametric Memory Law paper says a finetuned model memorizes exactly like that — line by line, token by token, with a hard click rather than a slow fade.

The "how much you rehearse" axis is what LoRA makes measurable. A LoRA adapter freezes the base model and trains only a small low-rank add-on whose size is set by a single rank knob, so the authors can turn capacity up and down cleanly and watch what memorizes. That's the whole trick of using LoRA as a probe: it converts the vague idea of "training harder" into a number you can plot.

Sweep that knob and the first half of the law appears. Memorized content doesn't climb linearly with capacity — it follows a power law in the effective parameters the adapter moves and the sequence length it's trained on. On a log-log plot it's a straight line. Practically, that means diminishing returns are baked in: each doubling of rank buys the same fixed step along the curve, not a fixed amount of new recall.

The second half is the surprising part, and it's where the metaphor's "click" earns its keep. Look at any single token under greedy decoding and the paper reports a clean rule: once the model's greedy probability for that exact token exceeds 0.5, recall is guaranteed. That's a one-way claim — crossing 0.5 makes verbatim reproduction a sure thing, while below it recall is no longer guaranteed (the token may still come out right, but the model might just as easily land on a synonym or a plausible neighbor). Recall isn't a dimmer switch that brightens with training; it's a phase transition, and 0.5 is the critical point. As you add capacity, tokens don't all get a little more recallable together — they flip into recall one at a time, in order of how easy each was to memorize.

That reframing is what makes MemFT obvious in hindsight. If recall is decided per token at a 0.5 cliff, then training effort spent on tokens already at 0.92 is wasted, and so is effort on a token stuck at 0.05 that no realistic capacity will rescue. The tokens that matter are the ones hovering just under 0.5. MemFT reallocates the optimization budget toward exactly those near-threshold tokens, so a fixed amount of capacity flips the most tokens over the line. It's the difference between rehearsing the whole poem evenly and drilling the two lines you keep fumbling.

How recall jumps with capacity

A capacity sweep on a single nine-token line, using illustrative greedy probabilities, shows the staircase the law predicts — recall climbs in discrete jumps, not a smooth ramp, and one token never makes it:

LoRA rank r	Effective params (d=4096)	Tokens past p > 0.5
r = 2	~16K (illustrative)	0 / 9 (illustrative)
r = 8	~66K (illustrative)	3 / 9 (illustrative)
r = 32	~262K (illustrative)	6 / 9 (illustrative)
r = 128	~1.0M (illustrative)	8 / 9 (illustrative)

Walk the numbers once. Take a single attention projection of width d = 4096. A full finetune would move all d² ≈ 16.8M weights; a LoRA of rank r touches only 2·d·r effective parameters. At r = 8 that's 2 × 4096 × 8 ≈ 65.5K — a 256× smaller knob than the full matrix. The power law says memorized content grows as a fixed power of that effective-parameter count times the sequence length, so doubling the rank to r = 16 (≈ 131K effective params) doesn't double recall — it advances you one fixed step along the curve, flipping whichever tokens were sitting just under p = 0.5 (illustrative). Spend that same step with MemFT aimed at the near-threshold tokens and you flip more of them per unit capacity.

Goes deeper in: LLM Serving → Multi-LoRA → LoRA math at serving time

Related explainers

PreFT — prefill-only LoRA adapters — another study that treats where and how much a LoRA adapter acts as the variable of interest
MSSP vs muP — MoE scaling — a different scaling-law axis (hyperparameter transfer), complementary to this paper's capacity-vs-recall law
RELEX — rank-1 RLVR weight-trajectory extrapolation — what a low-rank weight update does to behaviour, viewed from the RL side rather than the memorization side

Continue in trackLLM Serving — Multi-LoRA & the rank tradeoff

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based