The news. On July 2, 2026, researchers posted LACUNA, a benchmark that asks a sharper question about machine unlearning: not "did the model stop saying it?" but "did the fact actually leave the weights?" LACUNA plants synthetic PII into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, giving it a ground-truth location for each fact. It then compares localize-first unlearning methods and shows that output-only metrics can look strong while the knowledge stays imprecisely localized and open to resurfacing — and, crucially, that once you localize the responsible weights, even a plain gradient-based erase becomes far more robust. Read the preprint →
Think about the oldest trick for a messy floor: sweep the stain under the rug. Glance at the room and it looks spotless — the rug covers everything. But the stain is still there; anyone who lifts the rug, or just kicks it aside, sees it instantly. Unlearning a model can fail in exactly this way: the model refuses to repeat a fact, so the room looks clean, while the fact still sits in the weights under the rug. LACUNA is the inspector who insists on lifting the rug.
Here is why that distinction bites. The standard way to test unlearning is an output-level check: ask the model the forbidden question a bunch of ways and confirm it won't answer. A perfect refusal reads as a perfect forget. But a refusal is a glance at the room — it measures the output, which is just the last filter in front of the answer. The knowledge that produced the answer lives one layer down, in the model's parameters. Suppressing the output and erasing the weights are different jobs, and an output test cannot tell them apart.
LACUNA's move is to make the floor plan known before anyone cleans. It plants each test fact into predefined weights using masked continual pretraining, so it has ground truth for where the fact should be. Now it can run the real inspection: after an unlearning method claims success, does the planted fact actually leave the responsible weights — the feed-forward and attention parameters it was written into — or is it merely covered? The answer the paper reports is uncomfortable — methods that ace the output test often leave the knowledge imprecisely localized, and a resurfacing attack (a few relearning steps on a thin hint) nudges the rug aside and the fact reappears.
Where the gap actually shows up
Watch the two scores diverge on one planted fact (illustrative — LACUNA reports the pattern, not these exact figures). Ask the model the forbidden question 20 different ways — rephrasings, translations, indirect asks — and it refuses all of them. The output-level forget score reads a flawless 20/20 = 100%. Now run a resurfacing attack: fine-tune for a handful of steps on a thin hint, then re-ask. Say the fact comes back on 12 of the 20 probes. The true erasure was only 8/20 = 40% — the other 60% was under the rug the whole time. Same model, same weights, two verdicts: 100% at the output, ~40% at the weights. The number you trust decides whether you shipped a forgotten fact or a hidden one.
| What you measure | The test | What "pass" proves | Fooled by |
|---|---|---|---|
| Output level | Ask the question, check the model refuses | The answer is hidden right now | Suppression that leaves the fact in the weights |
| Weight level (LACUNA) | Check the planted parameters + run a resurfacing attack | The fact actually left the responsible weights (paper) | Little — the ground-truth location is known in advance |
The constructive half of the result is the part worth keeping. Because LACUNA knows where each fact lives, it can show that good localization is the lever: once an unlearning method targets the responsible weights precisely, even a simple gradient-based erase holds up against the resurfacing attack. In the metaphor, you stop rearranging the rug and actually scrub the marked spot — and the stain is gone for real. The caution to state plainly: LACUNA's facts are synthetic PII injected into open models, a controlled setting, so "robust here" is evidence about the evaluation method, not a guarantee that any deployed unlearning is airtight. Its contribution is the ruler, not a finished cleaner — a way to stop grading forgetting by a glance at the room.
Goes deeper in: LLM Internals → Transformer Block → The Feed-Forward Network
Related explainers
- Parametric Memory Law — verbatim recall — how much a model can memorize in the first place, and the threshold at which facts become verbatim-recallable from the weights
- WASH — watermark removal by model-averaging — the mirror-image problem: how hard it is to truly remove a signal baked into a model
- kNNGuard — training-free activation-space guardrail — another method that reads a model's own internals instead of trusting its surface output