Is Grep All You Need? — Grep vs vector retrieval for agentic search
AgentThe news. On May 14, 2026, the "Is Grep All You Need?" paper (arXiv:2605.15184) ran two empirical experiments on agentic search. The first wired both a literal grep tool and a vector-retrieval tool into the same agents across multiple platforms and ran 116 LongMemEval questions through each; across the conditions tested, grep generally yielded higher accuracy than vector retrieval. The second progressively injected irrelevant context to probe robustness. The authors’ headline framing: overall performance is dominated by the agent harness and tool-calling style, not by the retrieval algorithm alone. Read the paper →
Picture the Ctrl-F vs librarian split again. When the user asks "What did the report say about the Paris office in Q3?" the Ctrl-F path is brutally literal — type the string Paris, get every line that contains it, scan the surrounding sentences. The librarian path is semantic — describe the question, the librarian walks the shelves and hands you the three books that "feel close." When the shelf is clean and the question is fuzzy ("anything about European expansion"), the librarian wins. When the shelf has been padded with unrelated chapters and the question hinges on a literal token, Ctrl-F just works.
The animation above runs both tools against the same 32-chunk document store. At low noise, vector retrieval pulls the right 3 chunks — the embeddings line up. As irrelevant context grows (the noise dial on the left fills up), distractor chunks start clustering near the query in vector space, and the top-k starts mixing in chunks marked pink — the wrong ones. Grep keeps emitting the same 3 literal matches regardless of how much unrelated text was added, because literal substring matching is immune to the kind of distributional drift that nudges nearest neighbours around.
The paper’s more interesting claim sits underneath the headline. When the authors swap the agent harness — different platforms, different tool-call shapes, different stop conditions — accuracy moves more than when they swap retrieval algorithms. A well-designed harness running vector retrieval can beat a poorly-designed harness running grep. The retrieval algorithm is one knob among many, and it’s usually not the dominant one. That’s the line worth highlighting: this is not "embeddings are dead," it’s "your harness design and tool-loading strategy deserve more attention than your vector DB does."
Why irrelevant context hurts vector retrieval more
Vector retrieval works by distance in embedding space. Add irrelevant text, and three things happen at once. The query embedding drifts — the model is embedding a longer prompt that now mentions topics the user didn’t ask about. The document distribution shifts — distractor chunks crowd the latent space near the relevant ones. And the top-k cutoff stays fixed at, say, k = 5; if two distractors now sit closer than two relevant chunks, the relevant ones get evicted. The paper’s noise sweep makes this concrete: as noise rises, vector accuracy in the animation drops from roughly 71% to 35% (illustrative) while grep stays flat near 74%. Grep simply doesn’t care how much surrounding text exists; it matches what was typed and returns exactly those spans. The fragility isn’t in embeddings as a technology — it’s in the failure mode where the algorithm assumes a clean shelf and the harness keeps shoving random books onto it.
How the two tools compare inside an agent
| Property | Grep tool | Vector retrieval tool |
|---|---|---|
| Index built upfront | none (or a trie / inverted index) | embedding pass + ANN index |
| Query cost | O(N) linear scan, fast for <100MB | embed + nearest-neighbour lookup |
| Strength | literal tokens, code, identifiers, dates, names | paraphrase, synonyms, fuzzy topical match |
| Failure mode | misses synonyms / rephrasing | top-k crowded out by distractors under noise |
| Robust to irrelevant context | yes — match is local | no — distance shifts as embeddings move |
| Agentic fit | cheap to call iteratively, easy to chain with head / tail | one call returns a fixed top-k bundle |
The grid above is not the paper’s tagline. The tagline is: once you put either tool inside an agent loop, the harness eats the difference. A harness that lets the model call the tool again with a refined query, look at a snippet, and decide whether to keep going outperforms a harness that calls the tool once and dumps top-k into context — regardless of which tool sits behind the call.
What this changes for an agent designer
The practical reading is two-sided. Don’t reach for an embedding-index-and-vector-store the moment a project needs "search" — especially if the corpus is under a few hundred megabytes, the queries are literal-token-heavy (code, logs, identifiers, dates), and the agent will call the tool iteratively rather than once. Do still reach for vector retrieval when the corpus is large enough that a linear scan is slow, the queries are genuinely topical or paraphrased, or the agent has a single shot to retrieve. And independently of either choice, spend the design budget on the harness — on how the agent decides to call the tool, when it backs off, and how it threads results back into context. The paper is telling agent designers where the leverage actually lives.
Goes deeper in: AI Agents → Retrieval & RAG → RAG Failure Modes and AI Agents → Context Engineering → Tool-Loading Strategy
Related explainers
- AsyncFC — symbolic futures in the decode stream — another way the harness, not the model, decides how much wall-clock a tool call costs
- MCP SEP-2663 — async task handles for long-running tool calls — the protocol layer underneath whatever retrieval tool an agent picks