The news. On June 9, 2026, LARK Lab released Attention Amnesia, a study of what happens to hybrid linear-attention LLMs when you fine-tune them on chain-of-thought data. The finding is blunt: fine-tuning biases attention gradients toward short-range patterns, and long-range retrieval falls off a cliff. On a 9B hybrid model, NIAH-S2@256K recall collapses from 67.2% to 9.4% after CoT fine-tuning. Their training-free QK-Restore recovers a 5B model's NIAH@256K recall from 65.4% to 76.4% — above where careful fine-tuning left it — while keeping the reasoning gains. Read the paper →

Picture a librarian who has spent years learning the whole building, deep back stacks included. Ask for an obscure title and she walks straight to the right shelf two hundred meters back. Then she crams for weeks on a rapid-fire front-desk quiz — common questions, fast answers. She aces the quiz. But now, asked for that obscure title, she reflexively checks the front desk, shrugs, and gives up. The book is still there; the route to it just faded. That fade is attention amnesia, and the quiz is chain-of-thought fine-tuning: the very training that made her sharper at quick reasoning is what dimmed her reach into the back of the library.

Inside the model, "the route to the back stacks" is concrete. A token decides what to pay attention to by comparing its query against every earlier token's key — and those queries and keys are produced by two weight matrices, W_Q and W_K. Fine-tuning nudges every weight to lower the training loss, and the paper's key observation is that on chain-of-thought data the cheapest way to lower that loss is to lean on nearby tokens. So the gradients quietly reshape W_Q and W_K to favor short range, and the full-attention layers that used to span a quarter-million tokens increasingly prefer what is close. The needle, sitting at depth 250,000, lands in a blind spot.

Embeddingtoken vector
× W_Q
QQuery
× W_K
KKey
× W_V
VValue

Here is the mechanism as arithmetic (illustrative). An attention head spreads a probability budget of 1.0 across all earlier positions. With long reach, a needle at position 250,000 might receive weight 0.04 — small, but enough to surface into the answer. After fine-tuning biases the head short, that same head dumps roughly 0.97 of its budget onto the last 2,000 tokens, leaving ~0.03 to spread across the other 254,000 positions. The needle's slice falls toward zero, and a fact that was retrievable becomes invisible. That is why the measured recall doesn't degrade gracefully — it collapses, 67.2% down to 9.4% on the 9B model the paper reports.

The fix follows straight from the diagnosis. If the recall loss rides on the query and key projections, roll just those back. QK-Restore copies W_Q and W_K from the pre-fine-tuning checkpoint into the fine-tuned model and leaves everything else — the layers that actually learned to reason — untouched. No data, no gradient steps, no second training run: it is a weight swap, the equivalent of handing the librarian her old card catalog while she keeps every quiz answer she memorized. A Procrustes variant goes gentler, rotating the old Q/K to best align with the new ones so recall and reasoning don't tug against each other.

State of the modelLong-range recall (NIAH@256K)Reasoning gainsWeights changed
Before CoT fine-tuningstrong — ~67.2% (9B, reported)baseline
After CoT fine-tuningcollapses — ~9.4% (9B, reported)gainedall weights, incl. W_Q/W_K
After QK-Restorerestored — 65.4% → 76.4% (5B, reported)keptonly W_Q/W_K rolled back

Why does such a blunt move work — and even beat the original? Because the recall loss could be reverted on its own. Fine-tuning improved the reasoning machinery while shifting the addressing machinery — the Q/K projections — and rolling that one part back leaves the reasoning gains in place. On the 5B model the paper reports, recall climbs from the fine-tuned 65.4% to 76.4%, past where the careful fine-tune left it, because the restored Q/K reach is now paired with reasoning layers that the base model never had. The lesson generalizes past this one paper: when fine-tuning silently costs a capability, the cure may not be more training but undoing the few weights that moved too far — a reminder that where a token looks is a separate, restorable part of the network from what it knows.

Goes deeper in: LLM Internals → Attention → How a token looks back over the context

Related explainers

Continue in trackLLM Internals — Attention: how a token looks back over the whole context

Frequently Asked Questions