What is Logit-Contribution Scoring?

Logit-Contribution Scoring (LOCOS, arXiv 2607.01002) is a way to find a language model's retrieval heads by measuring what each attention head writes rather than where it looks. For a given answer token, it projects each head's OV-circuit output onto that token's unembedding direction and ranks heads by how much they push the answer's logit up. Unlike attention-overlap detectors, it catches heads that retrieve a fact by meaning without copying the exact token.

Why do attention-based retrieval-head detectors miss heads?

They score a head by whether the token it attends to matches the token the model generates — a literal-copy test. A head that reads a passage and writes the answer in different words never lines up its attention with the generated token, so a location check gives it no credit. LOCOS measures the head's contribution to the answer's logit instead, so non-literal, meaning-based retrieval heads become visible.

How does LOCOS prove the heads it finds actually matter?

By ablation. On Qwen3-8B, ablating the top 50 LOCOS-scored heads drops long-context recall from a ROUGE-L of 0.401 to 0.000, with matching collapses on MuSiQue (0.55 to 0.08) and BABILong (0.62 to 0.20). If turning off a head class flatlines the model's ability to surface a buried fact, those heads were causally responsible for the retrieval — not the ones an attention-overlap score would have flagged.

LOCOS finds non-literal retrieval heads by scoring logit contribution — Logit-Contribution Scoring

Jargon

Retrieval head: An attention head whose job is to fetch a specific fact from far back in the context and carry it forward to where the answer is written. Long-context recall leans on a small number of them.
Non-literal retrieval: Answering from the meaning of a passage rather than copying its exact tokens. The model reads "the vault opens on the word marigold" and later writes "the code word was marigold" — same fact, different words.
OV circuit: The "output-value" path of an attention head: the part that decides what a head writes into the residual stream once it has chosen where to look. Its counterpart, the QK circuit, decides where the head reads.
Unembedding matrix: The model's output layer (LM head): it maps the final hidden state to one logit per vocabulary token. The row for a given token is that token's unembedding direction.
Logit: The raw score a model assigns each token before the softmax. A higher logit for the answer token means the model is more confident it comes next.
Ablation: Switching a component off to test if it mattered. LOCOS ablates its top-scored heads; if long-context recall collapses, those heads were causally doing the retrieval.
ROUGE-L: A recall-style overlap score between the model's answer and the correct one, from 0 to 1. Here it is a stand-in for "did the model still surface the buried fact?"

The news. On July 1, 2026, a paper titled "LOCOS: Locating Non-Literal Retrieval Heads by Logit Contribution" (arXiv 2607.01002) argued that the standard way of finding a model's retrieval heads has a blind spot. Existing detectors reward a head when the token it attends to matches the token the model generates — a literal-copy test. LOCOS instead scores each head by how much its output pushes the answer token's logit up, catching heads that retrieve by meaning without copying. Ablating LOCOS's top-ranked heads on Qwen3-8B drops long-context recall to near zero, evidence they are the ones doing the work. Read the paper →

Picture a vast library and a question whose answer sits on exactly one shelf, buried among thousands. Send in a crowd of clerks — the model's attention heads — and later ask who actually found it. The old rule credits whichever clerk is still standing at the exact answer shelf: easy to check, because you just match their location against the shelf that held the fact. This is the literal-copy test, and for a head that copies the needle token verbatim it works fine.

But the clerk who read the shelf, walked off, and wrote the answer in their own words gets no credit — they aren't standing where the fact lives anymore. That is a non-literal retrieval head: it answers from the meaning of a passage, so its attention no longer points at the token you would check against. As the paper puts it, the old detectors "reward heads whose attended token matches the generated token, a literal-copy criterion that captures where a head reads but not what it writes." A whole class of retrieval heads is invisible to a location check.

LOCOS changes what gets measured: not where a clerk stands, but how much their note reaches the answer slip. Concretely, it takes each head's OV-circuit output — what the head actually writes into the residual stream — and projects it onto the unembedding direction of the answer token, the same direction the LM head uses to turn a hidden state into that token's logit. A head that shoves the answer token's score upward gets a high logit contribution, whether or not it ever pointed at the needle. It contrasts the needle position against off-needle positions in a single forward pass, then ranks every head by write-aware contribution.

Now put numbers on it. Take Qwen3-8B, score all its heads by logit contribution in one forward pass, and ablate the top 50 — and long-context answering doesn't just dip, it flatlines: ROUGE-L falls from 0.401 to 0.000. The collapse holds across task types on the same model: multi-hop MuSiQue drops from 0.55 to 0.08, and BABILong from 0.62 to 0.20. Knock out 50 write-aware heads and a model that could read a novel-length prompt suddenly can't recall the one line that mattered — which is the causal proof that those heads, not the ones a location check would have flagged, were doing the retrieval. The paper reports the same pattern across Qwen3, Gemma-3, and OLMo-3.1.

Way to find retrieval heads	What it measures	Catches literal-copy heads?	Catches meaning-based heads?
Attention-overlap (retrieval score)	does the head's attended token equal the generated token?	yes	no — misses them
LOCOS (arXiv 2607.01002)	how much the head's OV output moves the answer token's logit	yes	yes

What makes this clean is that the measuring stick was already inside the model. You do not train anything extra: the answer token's own unembedding direction tells you which write to look for, and each head's OV output tells you who is doing that writing. LOCOS just takes the dot product and ranks. The lesson generalizes past this one paper — when you want to know which part of a network is responsible for an output, follow what it writes toward that output, not where it happens to be looking.

Goes deeper in: LLM Internals → Attention → Multi-Head Attention

Related explainers

EmbedFilter — Unembedding matrix as a feature lens — another result that reads a diagnostic straight off the unembedding matrix
BlockSearch — Attention dilution — why long-context retrieval collapses in the first place, and what it looks like from the attention side
HydraHead — Per-head attention hybridization — once you know which heads are retrieval-critical, you can spend the expensive attention only on them

Continue in trackLLM Internals — Attention: what each head does

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based