Why can output-only unlearning tests be misleading?

Because a refusal only proves the answer is hidden right now, not that the underlying knowledge is gone. LACUNA shows that methods which ace the output test often leave the fact imprecisely localized in the weights, where a resurfacing attack — a few relearning steps on a thin hint — can bring it back. It is the difference between sweeping a stain under the rug and scrubbing it out.

How does LACUNA know where a fact lives in the model?

It plants the fact there on purpose. Using masked continual pretraining, LACUNA injects synthetic personal data into predefined parameters of open 1B and 7B OLMo-based models, giving it a ground-truth location for each fact. That known location is what lets it grade whether an unlearning method targeted the responsible weights — and it finds that once localization is precise, even a simple gradient-based erase becomes far more robust.

LACUNA tests whether LLM unlearning hits the right weights — Output-level vs weight-level unlearning evaluation

Q: What is output-level vs weight-level unlearning evaluation?

They are two ways to check whether a model has forgotten a fact. Output-level evaluation asks the model the question and checks that it refuses to answer — it measures the model's outputs. Weight-level evaluation checks the parameters themselves. LACUNA (arXiv 2607.02513, July 2026) plants each test fact into known weights, so it can verify whether unlearning actually removed the fact from those weights or merely suppressed the answer. The two can disagree: a model can pass the output test while the fact still sits in its weights.

Jargon

Machine unlearning: Making a trained model forget a specific fact or example — a person's data, a copyrighted passage — without retraining from scratch. The target of the whole exercise.
PII: Personally identifiable information — a name, address, phone number. The canonical thing a model is legally asked to unlearn, and what LACUNA plants as its test facts.
Parameter-level localization: Finding which weights actually hold a given fact — spread across feed-forward and attention parameters. Getting this right is the hard part of erasing a fact cleanly.
Masked continual pretraining: LACUNA's injection trick: it keeps training an open model on synthetic PII but masks the update to predefined parameters, so the planted fact lands in a known location it can later inspect.
Output-level vs weight-level evaluation: Output-level checks only the model's answers; weight-level checks the parameters themselves. LACUNA's whole point is that the two can disagree.
Resurfacing attack: A probe — a few relearning steps on a thin hint, or a rephrased or indirect ask — that recovers a supposedly forgotten fact. If it works, the knowledge was hidden, not removed.
OLMo: An open model family (weights and training data public). LACUNA builds on 1B and 7B OLMo models precisely because you can inject and then inspect their parameters.

The news. On July 2, 2026, researchers posted LACUNA, a benchmark that asks a sharper question about machine unlearning: not "did the model stop saying it?" but "did the fact actually leave the weights?" LACUNA plants synthetic PII into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, giving it a ground-truth location for each fact. It then compares localize-first unlearning methods and shows that output-only metrics can look strong while the knowledge stays imprecisely localized and open to resurfacing — and, crucially, that once you localize the responsible weights, even a plain gradient-based erase becomes far more robust. Read the preprint →

Think about the oldest trick for a messy floor: sweep the stain under the rug. Glance at the room and it looks spotless — the rug covers everything. But the stain is still there; anyone who lifts the rug, or just kicks it aside, sees it instantly. Unlearning a model can fail in exactly this way: the model refuses to repeat a fact, so the room looks clean, while the fact still sits in the weights under the rug. LACUNA is the inspector who insists on lifting the rug.

Here is why that distinction bites. The standard way to test unlearning is an output-level check: ask the model the forbidden question a bunch of ways and confirm it won't answer. A perfect refusal reads as a perfect forget. But a refusal is a glance at the room — it measures the output, which is just the last filter in front of the answer. The knowledge that produced the answer lives one layer down, in the model's parameters. Suppressing the output and erasing the weights are different jobs, and an output test cannot tell them apart.

LACUNA's move is to make the floor plan known before anyone cleans. It plants each test fact into predefined weights using masked continual pretraining, so it has ground truth for where the fact should be. Now it can run the real inspection: after an unlearning method claims success, does the planted fact actually leave the responsible weights — the feed-forward and attention parameters it was written into — or is it merely covered? The answer the paper reports is uncomfortable — methods that ace the output test often leave the knowledge imprecisely localized, and a resurfacing attack (a few relearning steps on a thin hint) nudges the rug aside and the fact reappears.

Where the gap actually shows up

Watch the two scores diverge on one planted fact (illustrative — LACUNA reports the pattern, not these exact figures). Ask the model the forbidden question 20 different ways — rephrasings, translations, indirect asks — and it refuses all of them. The output-level forget score reads a flawless 20/20 = 100%. Now run a resurfacing attack: fine-tune for a handful of steps on a thin hint, then re-ask. Say the fact comes back on 12 of the 20 probes. The true erasure was only 8/20 = 40% — the other 60% was under the rug the whole time. Same model, same weights, two verdicts: 100% at the output, ~40% at the weights. The number you trust decides whether you shipped a forgotten fact or a hidden one.

What you measure	The test	What "pass" proves	Fooled by
Output level	Ask the question, check the model refuses	The answer is hidden right now	Suppression that leaves the fact in the weights
Weight level (LACUNA)	Check the planted parameters + run a resurfacing attack	The fact actually left the responsible weights (paper)	Little — the ground-truth location is known in advance

The constructive half of the result is the part worth keeping. Because LACUNA knows where each fact lives, it can show that good localization is the lever: once an unlearning method targets the responsible weights precisely, even a simple gradient-based erase holds up against the resurfacing attack. In the metaphor, you stop rearranging the rug and actually scrub the marked spot — and the stain is gone for real. The caution to state plainly: LACUNA's facts are synthetic PII injected into open models, a controlled setting, so "robust here" is evidence about the evaluation method, not a guarantee that any deployed unlearning is airtight. Its contribution is the ruler, not a finished cleaner — a way to stop grading forgetting by a glance at the room.

Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network

Related explainers

Parametric Memory Law — verbatim recall — how much a model can memorize in the first place, and the threshold at which facts become verbatim-recallable from the weights
WASH — watermark removal by model-averaging — the mirror-image problem: how hard it is to truly remove a signal baked into a model
kNNGuard — training-free activation-space guardrail — another method that reads a model's own internals instead of trusting its surface output

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based