What is attention dilution?

Attention dilution is the failure BlockSearch (arXiv 2607.01538, July 2026) blames for long-context retrieval collapse. Attention's softmax splits one fixed budget of attention across every token by dividing each token's exponentiated score by the sum over all tokens — the denominator. As you paste in more and more irrelevant documents, they swell that denominator, so the normalized attention weight on the gold document shrinks even though its raw pre-softmax score is unchanged. The right document is still there; attention just can no longer afford to look at it.

Why does long-context retrieval collapse as the corpus grows?

Because the softmax normalizes by the total over all tokens. Hold the gold document's raw score fixed and grow the number of competing tokens, and its share of attention falls purely arithmetically — from, say, 20% in a small context to 1% at a million tokens, with no change to the document itself. That is why the fix BlockSearch proposes is a length-aware softmax denominator plus document-level sparse attention, not a better query or a bigger model.

How does BlockSearch relate to dense vector retrieval?

Dense retrieval embeds the query and documents as vectors and ranks them with a separate nearest-neighbor index, using one fixed notion of similarity. BlockSearch keeps retrieval inside the model: a 0.6B model conditions on the in-context corpus and generates the answer directly, with a length-aware softmax so it does not dilute. The paper reports it matching dense retrieval on MS MARCO and NQ at a million tokens while being about 7× smaller than the concurrent MSA model, and scoring about 3× higher on LIMIT, a task that needs a different notion of similarity than a fixed embedding gives.

BlockSearch traces million-token retrieval collapse — Attention dilution

TL;DR

What is it: The BlockSearch paper (arXiv 2607.01538) is the first systematic study of in-context retrieval at the million-token scale, and it traces long-context retrieval collapse to one mechanism: attention dilution — the softmax denominator quietly drowning the document you need.
Why it’s needed: Pasting a huge corpus into the context and asking the model to find the right document is how many agents ground answers in evidence. If attention loses the gold document as the context grows, the whole retrieve-then-generate step breaks exactly when you scale it up.
vs previous: Unlike dense (vector) retrieval — a separate encoder plus an approximate-nearest-neighbor index with one fixed notion of similarity — BlockSearch keeps retrieval inside the model's own attention and repairs the softmax so it stops diluting, matching dense retrieval at 1 million tokens with a 0.6B model.

Jargon

In-context retrieval: Finding the right document by pasting the whole corpus into the model's context and letting attention pick it out, instead of running a separate search index. BlockSearch is a study of how well this holds up at a million tokens.
Softmax: The step that turns raw attention scores into weights that sum to 1 — one fixed budget of attention split across every token. Each token's share is its exponentiated score divided by the sum over all tokens.
Softmax denominator: The sum of every token's exponentiated score — the "total" each token's share is divided by. It is the part that grows as you add more tokens, and the direct cause of dilution.
Attention dilution: The paper's name for the failure: as the corpus grows, irrelevant documents swell the denominator, so the normalized weight on the gold document shrinks even when its raw score is unchanged.
Gold document: The one document that actually answers the query — the needle the model is supposed to attend to in the haystack of the pasted corpus.
Dense (vector) retrieval: The standard alternative: embed the query and every document as vectors, then rank by similarity with a nearest-neighbor index. BlockSearch matches it at a million tokens while staying inside the model.
Length generalization: Working far beyond the sequence lengths seen in training. BlockSearch reports holding up at roughly 10× its training-time corpus size, where naive attention would already have collapsed.

The news. On July 1, 2026, researchers released BlockSearch, a 0.6B-parameter language-model retriever that conditions on an in-context corpus and generates the answer directly — an alternative to vector search. It is the first systematic study of in-context retrieval at the million-token scale, and it traces retrieval collapse to attention dilution: as the corpus grows, irrelevant documents dominate the softmax denominator, so the normalized attention mass on the gold document shrinks even when its pre-softmax score stays high. BlockSearch adds length-aware softmax adjustments and document-level sparse attention, matching dense retrieval on MS MARCO and NQ at 1 million tokens — 7× smaller than the concurrent MSA model — and scoring 3× higher on LIMIT, a task that needs a different notion of similarity. Read the paper →

Picture one spotlight over a stage, and it has a fixed amount of light to give. The softmax works the same way: it hands out one fixed budget of attention, split across every token by how high each token already scores. When only your lead actor is on stage, almost all the light lands on them. Now let the stage fill with extras — a growing crowd of documents you pasted in but do not care about. The star has not dimmed one bit. But the same fixed light is now shared among far more bodies, so the star's slice of it shrinks. That thinning slice is exactly what BlockSearch calls attention dilution.

Here is the part that makes it a clean, teachable failure of the attention softmax rather than a modeling mistake. The gold document's raw score — its pre-softmax match against the query — never changed; only the crowd around it grew. Softmax normalizes by dividing each token's exponentiated score by the sum over all tokens, and that sum is the denominator. Pour thousands of irrelevant documents into the context and each one adds a little to the denominator. The numerator for the gold document holds steady while the denominator balloons, so the weight that actually flows to it drains toward zero. The model still "sees" the right document; it just can no longer afford to look at it.

So how do you get the light back on the star? BlockSearch does two things, and both are about the crowd, not the star. First, it makes the softmax length-aware — it adjusts the normalization so a longer corpus does not automatically starve every token — and second, it uses document-level sparse attention so the full pile no longer floods one shared softmax all at once. In the stage metaphor: re-aim the spotlight so the extras stop stealing its budget, and clear most of them off the stage before the light is split at all. Together these let a 0.6B retriever hold its focus where naive attention would have lost it, and length-generalize to roughly 10× the corpus size it was trained on.

Watch the denominator do the damage with one concrete pass. Give the gold document a raw exponentiated weight of 20 and hold it fixed. In a small context of 80 other tokens, each contributing about 1, the denominator is 20 + 80 = 100, so the gold document's share of attention is 20 / 100 = 20%. Now scale the corpus to a million tokens, so the gold document competes with roughly 1,980 units of stray weight: the denominator becomes 20 + 1,980 = 2,000, and its share collapses to 20 / 2,000 = 1% (illustrative numbers — the paper reports the effect, not these exact per-token values). The score in the numerator never moved; a 20× larger denominator did all of the work. That is why a length-aware denominator, not a smarter query, is the fix.

Approach	How it finds the document	Where it struggles
Dense / vector retrieval	embed query + docs, rank by similarity in a separate ANN index	One fixed notion of similarity; a task like LIMIT needs a different one
In-context retrieval (naive)	paste the corpus, let attention pick the document out directly	Collapses at long context via attention dilution
BlockSearch (0.6B)	in-context, but with a length-aware softmax + document-level sparse attention [paper]	Matches dense at 1M tokens, ~3× higher on LIMIT, ~7× smaller than MSA [paper]

Goes deeper in: LLM Internals → Attention → Computing Attention Scores

Related explainers

RoPE provably fails at long context — the other long-context failure mode: position encoding losing its power to discriminate, where attention dilution is about the softmax denominator, not position.
Is Grep All You Need? — grep vs vector retrieval — the same "search inside the model vs a separate index" debate that BlockSearch's in-context retriever pushes on.
SubQ 1.1 — subquadratic sparse attention — the toolbox BlockSearch reaches into: sparse attention as the way to keep million-token context affordable.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based