The news. On July 1, 2026, researchers released BlockSearch, a 0.6B-parameter language-model retriever that conditions on an in-context corpus and generates the answer directly — an alternative to vector search. It is the first systematic study of in-context retrieval at the million-token scale, and it traces retrieval collapse to attention dilution: as the corpus grows, irrelevant documents dominate the softmax denominator, so the normalized attention mass on the gold document shrinks even when its pre-softmax score stays high. BlockSearch adds length-aware softmax adjustments and document-level sparse attention, matching dense retrieval on MS MARCO and NQ at 1 million tokens — 7× smaller than the concurrent MSA model — and scoring 3× higher on LIMIT, a task that needs a different notion of similarity. Read the paper →
Picture one spotlight over a stage, and it has a fixed amount of light to give. The softmax works the same way: it hands out one fixed budget of attention, split across every token by how high each token already scores. When only your lead actor is on stage, almost all the light lands on them. Now let the stage fill with extras — a growing crowd of documents you pasted in but do not care about. The star has not dimmed one bit. But the same fixed light is now shared among far more bodies, so the star's slice of it shrinks. That thinning slice is exactly what BlockSearch calls attention dilution.
Here is the part that makes it a clean, teachable failure of the attention softmax rather than a modeling mistake. The gold document's raw score — its pre-softmax match against the query — never changed; only the crowd around it grew. Softmax normalizes by dividing each token's exponentiated score by the sum over all tokens, and that sum is the denominator. Pour thousands of irrelevant documents into the context and each one adds a little to the denominator. The numerator for the gold document holds steady while the denominator balloons, so the weight that actually flows to it drains toward zero. The model still "sees" the right document; it just can no longer afford to look at it.
So how do you get the light back on the star? BlockSearch does two things, and both are about the crowd, not the star. First, it makes the softmax length-aware — it adjusts the normalization so a longer corpus does not automatically starve every token — and second, it uses document-level sparse attention so the full pile no longer floods one shared softmax all at once. In the stage metaphor: re-aim the spotlight so the extras stop stealing its budget, and clear most of them off the stage before the light is split at all. Together these let a 0.6B retriever hold its focus where naive attention would have lost it, and length-generalize to roughly 10× the corpus size it was trained on.
Watch the denominator do the damage with one concrete pass. Give the gold document a raw exponentiated weight of 20 and hold it fixed. In a small context of 80 other tokens, each contributing about 1, the denominator is 20 + 80 = 100, so the gold document's share of attention is 20 / 100 = 20%. Now scale the corpus to a million tokens, so the gold document competes with roughly 1,980 units of stray weight: the denominator becomes 20 + 1,980 = 2,000, and its share collapses to 20 / 2,000 = 1% (illustrative numbers — the paper reports the effect, not these exact per-token values). The score in the numerator never moved; a 20× larger denominator did all of the work. That is why a length-aware denominator, not a smarter query, is the fix.
| Approach | How it finds the document | Where it struggles |
|---|---|---|
| Dense / vector retrieval | embed query + docs, rank by similarity in a separate ANN index | One fixed notion of similarity; a task like LIMIT needs a different one |
| In-context retrieval (naive) | paste the corpus, let attention pick the document out directly | Collapses at long context via attention dilution |
| BlockSearch (0.6B) | in-context, but with a length-aware softmax + document-level sparse attention [paper] | Matches dense at 1M tokens, ~3× higher on LIMIT, ~7× smaller than MSA [paper] |
Goes deeper in: LLM Internals → Attention → Computing Attention Scores
Related explainers
- RoPE provably fails at long context — the other long-context failure mode: position encoding losing its power to discriminate, where attention dilution is about the softmax denominator, not position.
- Is Grep All You Need? — grep vs vector retrieval — the same "search inside the model vs a separate index" debate that BlockSearch's in-context retriever pushes on.
- SubQ 1.1 — subquadratic sparse attention — the toolbox BlockSearch reaches into: sparse attention as the way to keep million-token context affordable.