Parametric Memory Law — the p > 0.5 recall threshold

LLM
L
Verbatim recall snaps on once a token's greedy probability passes 0.5guaranteed verbatim: 0 / 90.000.250.751.00greedy P(next token)Tobeornottobethatisquestionp = 0.5 · verbatim thresholdtarget sequence to reproduce verbatimmodel's probability for this exact tokenLoRA capacity (rank r)r=2r=8r=32r=128Parametric Memory Law — recall is a phase transitionp > 0.5 ⇒ verbatim8 / 9 lockedΔL ∝ (effective params × seq length)^α
learnaivisually.com/ai-explained/parametric-memory-law-verbatim-recall

The news. On May 29, 2026, the Parametric Memory Law paper used LoRA as a controlled probe to quantify how much memory a finetune installs in a model's weights. By sweeping the LoRA rank and the training sequence length, the authors fit a power law linking the two to how much content gets memorized, and they pin down a striking detail: a token is recalled verbatim once its greedy prediction probability passes 0.5. They package the finding into MemFT, a recipe that steers training toward the tokens still under that threshold.

Picture the metaphor for a moment. You're cramming a poem the night before a recital. Early on you can gist every line — you know roughly what each one says — but if someone stops you mid-verse you paraphrase. Then, after enough rehearsal, a line clicks: you can suddenly recite it word-for-word, and it stays locked. The click isn't gradual. One pass you're approximating, the next pass it's verbatim. The Parametric Memory Law paper says a finetuned model memorizes exactly like that — line by line, token by token, with a hard click rather than a slow fade.

The "how much you rehearse" axis is what LoRA makes measurable. A LoRA adapter freezes the base model and trains only a small low-rank add-on whose size is set by a single rank knob, so the authors can turn capacity up and down cleanly and watch what memorizes. That's the whole trick of using LoRA as a probe: it converts the vague idea of "training harder" into a number you can plot.

🔒Frozen (INT4)Attention layer 1~billions of paramsLoRAFP16trainableLayer 1🔒Frozen (INT4)Attention layer 2~billions of paramsLoRAFP16trainableLayer 2Only adapters are trained — main model stays frozen in 4-bitModel: 65B params in INT4 ≈ 33 GB0.1B / FP16Adapters ≈ 0.2 GB167× larger than adaptersFrozen (INT4) — not updatedLoRA adapter (FP16) — trained

Sweep that knob and the first half of the law appears. Memorized content doesn't climb linearly with capacity — it follows a power law in the effective parameters the adapter moves and the sequence length it's trained on. On a log-log plot it's a straight line. Practically, that means diminishing returns are baked in: each doubling of rank buys the same fixed step along the curve, not a fixed amount of new recall.

The second half is the surprising part, and it's where the metaphor's "click" earns its keep. Look at any single token under greedy decoding and the paper reports a clean rule: once the model's greedy probability for that exact token exceeds 0.5, recall is guaranteed. That's a one-way claim — crossing 0.5 makes verbatim reproduction a sure thing, while below it recall is no longer guaranteed (the token may still come out right, but the model might just as easily land on a synonym or a plausible neighbor). Recall isn't a dimmer switch that brightens with training; it's a phase transition, and 0.5 is the critical point. As you add capacity, tokens don't all get a little more recallable together — they flip into recall one at a time, in order of how easy each was to memorize.

That reframing is what makes MemFT obvious in hindsight. If recall is decided per token at a 0.5 cliff, then training effort spent on tokens already at 0.92 is wasted, and so is effort on a token stuck at 0.05 that no realistic capacity will rescue. The tokens that matter are the ones hovering just under 0.5. MemFT reallocates the optimization budget toward exactly those near-threshold tokens, so a fixed amount of capacity flips the most tokens over the line. It's the difference between rehearsing the whole poem evenly and drilling the two lines you keep fumbling.

How recall jumps with capacity

A capacity sweep on a single nine-token line, using illustrative greedy probabilities, shows the staircase the law predicts — recall climbs in discrete jumps, not a smooth ramp, and one token never makes it:

LoRA rank rEffective params (d=4096)Tokens past p > 0.5
r = 2~16K (illustrative)0 / 9 (illustrative)
r = 8~66K (illustrative)3 / 9 (illustrative)
r = 32~262K (illustrative)6 / 9 (illustrative)
r = 128~1.0M (illustrative)8 / 9 (illustrative)

Walk the numbers once. Take a single attention projection of width d = 4096. A full finetune would move all d² ≈ 16.8M weights; a LoRA of rank r touches only 2·d·r effective parameters. At r = 8 that's 2 × 4096 × 8 ≈ 65.5K — a 256× smaller knob than the full matrix. The power law says memorized content grows as a fixed power of that effective-parameter count times the sequence length, so doubling the rank to r = 16 (≈ 131K effective params) doesn't double recall — it advances you one fixed step along the curve, flipping whichever tokens were sitting just under p = 0.5 (illustrative). Spend that same step with MemFT aimed at the near-threshold tokens and you flip more of them per unit capacity.

Goes deeper in: LLM Serving → Multi-LoRA → LoRA math at serving time

Related explainers

Continue in trackLLM Serving — Multi-LoRA & the rank tradeoff

Frequently Asked Questions