QK-Restore is a training-free fix for attention amnesia — the loss of long-range recall that chain-of-thought fine-tuning causes in hybrid linear-attention LLMs. It copies the query and key projection matrices (W_Q and W_K) from the model's pre-fine-tuning checkpoint back into the fine-tuned model, leaving every other weight untouched. Because it is just a weight swap, it needs no data and no gradient steps, and it keeps the reasoning gains fine-tuning added.

Why does chain-of-thought fine-tuning hurt long-range recall?

Fine-tuning adjusts every weight to lower the training loss, and on chain-of-thought data the easiest way to do that is to lean on nearby tokens. That biases the query and key projections toward short-range attention, so the full-attention layers that used to span a 256K context increasingly favor nearby tokens over distant ones. A fact buried deep in the context falls into a blind spot — on a 9B hybrid model, NIAH-S2@256K recall collapsed from 67.2% to 9.4%.

How does this relate to RoPE's long-context limits?

They are different failure modes. RoPE's long-context limits are a hard ceiling baked into positional encoding — the model literally cannot tell some far-apart positions apart. Attention amnesia is an acquired, fixable injury: fine-tuning moved the Q/K projections, and rolling them back with QK-Restore recovers the recall. One is a property of the architecture; the other is a side effect of training.

Attention Amnesia: CoT fine-tuning wrecks long-range recall in hybrid LLMs — Training-free QK-Restore

TL;DR

What is it: The Attention Amnesia paper shows that chain-of-thought fine-tuning silently destroys long-range retrieval in hybrid linear-attention LLMs, and introduces QK-Restore — a training-free fix that rolls the query/key projections back to the pre-fine-tuning checkpoint.
Why it’s needed: Teams fine-tune base models to make them reason better, and this paper finds that the same step can quietly knock out the model's ability to recall a fact buried deep in a long context — a hidden tax you only notice on a needle-in-a-haystack test. QK-Restore buys that recall back for free.
vs previous: The obvious responses — accept the loss, or re-fine-tune with long-context data — either give up the capability or cost another training run. QK-Restore changes only the Q and K projection matrices, keeps every other fine-tuned weight so the reasoning gains stay, and needs zero gradient steps.

Jargon

Hybrid linear-attention LLM: A model that mixes two attention styles: cheap linear-attention layers that compress the past into a fixed-size state, plus a few full softmax-attention layers that can look back precisely. Those full layers are what let it look back across a long context — and what this paper finds fine-tuning damages.
Chain-of-thought (CoT) fine-tuning: Training a base model on worked, step-by-step reasoning traces so it learns to "think out loud." It reliably improves reasoning — but the paper shows it biases attention toward nearby tokens as a side effect.
NIAH (needle-in-a-haystack): A long-context stress test: hide one fact (the needle) at a known depth inside a huge context (the haystack) and ask the model to retrieve it. S2/S3 and @256K name the difficulty and the 256,000-token context length.
Q/K projection matrices: The weight matrices W_Q and W_K that turn each token's embedding into a query and a key. Their dot product decides where a token looks; shift them and you shift the model's whole attention reach.
QK-Restore: The fix. It copies the W_Q and W_K matrices from the pre-fine-tuning checkpoint back into the fine-tuned model, leaving every other weight as fine-tuning left it. No gradient steps — it's a weight swap.
Procrustes alignment: A variant that doesn't copy the old Q/K verbatim but rotates them to best line up with the fine-tuned ones, to balance long-range recall against reasoning quality so the two don't fight.

The news. On June 9, 2026, LARK Lab released Attention Amnesia, a study of what happens to hybrid linear-attention LLMs when you fine-tune them on chain-of-thought data. The finding is blunt: fine-tuning biases attention gradients toward short-range patterns, and long-range retrieval falls off a cliff. On a 9B hybrid model, NIAH-S2@256K recall collapses from 67.2% to 9.4% after CoT fine-tuning. Their training-free QK-Restore recovers a 5B model's NIAH@256K recall from 65.4% to 76.4% — above where careful fine-tuning left it — while keeping the reasoning gains. Read the paper →

Picture a librarian who has spent years learning the whole building, deep back stacks included. Ask for an obscure title and she walks straight to the right shelf two hundred meters back. Then she crams for weeks on a rapid-fire front-desk quiz — common questions, fast answers. She aces the quiz. But now, asked for that obscure title, she reflexively checks the front desk, shrugs, and gives up. The book is still there; the route to it just faded. That fade is attention amnesia, and the quiz is chain-of-thought fine-tuning: the very training that made her sharper at quick reasoning is what dimmed her reach into the back of the library.

Inside the model, "the route to the back stacks" is concrete. A token decides what to pay attention to by comparing its query against every earlier token's key — and those queries and keys are produced by two weight matrices, W_Q and W_K. Fine-tuning nudges every weight to lower the training loss, and the paper's key observation is that on chain-of-thought data the cheapest way to lower that loss is to lean on nearby tokens. So the gradients quietly reshape W_Q and W_K to favor short range, and the full-attention layers that used to span a quarter-million tokens increasingly prefer what is close. The needle, sitting at depth 250,000, lands in a blind spot.

Here is the mechanism as arithmetic (illustrative). An attention head spreads a probability budget of 1.0 across all earlier positions. With long reach, a needle at position 250,000 might receive weight 0.04 — small, but enough to surface into the answer. After fine-tuning biases the head short, that same head dumps roughly 0.97 of its budget onto the last 2,000 tokens, leaving ~0.03 to spread across the other 254,000 positions. The needle's slice falls toward zero, and a fact that was retrievable becomes invisible. That is why the measured recall doesn't degrade gracefully — it collapses, 67.2% down to 9.4% on the 9B model the paper reports.

The fix follows straight from the diagnosis. If the recall loss rides on the query and key projections, roll just those back. QK-Restore copies W_Q and W_K from the pre-fine-tuning checkpoint into the fine-tuned model and leaves everything else — the layers that actually learned to reason — untouched. No data, no gradient steps, no second training run: it is a weight swap, the equivalent of handing the librarian her old card catalog while she keeps every quiz answer she memorized. A Procrustes variant goes gentler, rotating the old Q/K to best align with the new ones so recall and reasoning don't tug against each other.

State of the model	Long-range recall (NIAH@256K)	Reasoning gains	Weights changed
Before CoT fine-tuning	strong — ~67.2% (9B, reported)	baseline	—
After CoT fine-tuning	collapses — ~9.4% (9B, reported)	gained	all weights, incl. W_Q/W_K
After QK-Restore	restored — 65.4% → 76.4% (5B, reported)	kept	only W_Q/W_K rolled back

Why does such a blunt move work — and even beat the original? Because the recall loss could be reverted on its own. Fine-tuning improved the reasoning machinery while shifting the addressing machinery — the Q/K projections — and rolling that one part back leaves the reasoning gains in place. On the 5B model the paper reports, recall climbs from the fine-tuned 65.4% to 76.4%, past where the careful fine-tune left it, because the restored Q/K reach is now paired with reasoning layers that the base model never had. The lesson generalizes past this one paper: when fine-tuning silently costs a capability, the cure may not be more training but undoing the few weights that moved too far — a reminder that where a token looks is a separate, restorable part of the network from what it knows.

Goes deeper in: LLM Internals → Attention → How a token looks back over the context

Related explainers

RoPE provably fails at long context — position and token discrimination limits — a different way long-range attention breaks: a hard ceiling baked into positional encoding, not a fixable fine-tuning side effect
Gated DeltaNet-2 — decoupled erase/write gates — a closer look at the linear-attention half of the hybrid models this paper studies
Parallax — local-linear attention vs FlashAttention — another angle on trading full attention's reach for linear-attention's cheap memory

Continue in trackLLM Internals — Attention: how a token looks back over the whole context

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based