What does 'Is Grep All You Need?' actually claim?

The paper runs two empirical experiments. First, it wires both a literal grep tool and a vector-retrieval tool into the same agents across multiple platforms and runs 116 LongMemEval questions through each — grep generally yields higher accuracy than vector retrieval across the conditions tested. Second, it progressively injects irrelevant context and shows that vector retrieval degrades while grep stays roughly flat. The authors' framing is more nuanced than 'grep wins': overall performance is dominated by the agent harness and tool-calling style, not by the retrieval algorithm itself.

Does this mean vector embeddings are obsolete?

No. Vector retrieval still wins when the corpus is large enough that a linear grep scan is slow, when the queries are genuinely paraphrased or topical, or when the agent gets a single shot at retrieval and can't iterate. What changes is the default: an agent designer should no longer reach for an embedding-index-and-vector-store the moment a project needs 'search.' For literal-token-heavy queries over a small-to-medium corpus inside an iterative agent loop, grep is often the better starting point.

Why does irrelevant context hurt vector retrieval more than grep?

Vector retrieval ranks by distance in embedding space. When irrelevant text is added to the corpus or the query, three things shift at once: the query embedding drifts, distractor chunks crowd the latent space near the relevant ones, and the fixed top-k cutoff can evict relevant chunks in favour of newly-near distractors. Grep matches literal substrings and is immune to any of that — the matches are local to the chunk, not relative to the rest of the shelf.

Is Grep All You Need? — Grep vs vector retrieval for agentic search

Agent

learnaivisually.com/ai-explained/grep-vs-vector-agentic-retrieval

TL;DR

What is it: The "Is Grep All You Need?" study wires both a literal grep tool and a vector-search tool into the same agents and runs 116 LongMemEval questions across them — measuring how each retrieval style holds up as irrelevant context is progressively added to the prompt.
Why it’s needed: Vector retrieval has been the default for agentic search for years, but it pays for an embedding pass, a vector store, and an ANN index — costs that only buy something if the semantic match is doing real work. If grep is close to as accurate and more robust to noise, a whole layer of infrastructure becomes optional.
vs previous: Earlier RAG benchmarks compared retrieval algorithms on standalone retrieval quality (recall@k against a labelled gold set); this paper measures them inside an agent loop, where tool-calling style and harness design turn out to matter more than the algorithm itself.

Jargon

Agentic search: A retrieval pattern where the LLM iteratively calls a search tool instead of being handed retrieved chunks in a single shot. The model decides what to search for, when to search again, and when to stop.
Vector retrieval: Embed every document chunk into a fixed-dimension vector, embed the query the same way, return the top-k nearest chunks by cosine or dot-product similarity. The default for "RAG" since 2023. Embeddings primer →
Grep: The classic Unix tool — finds literal substrings or regex patterns in text files. Inside an agent harness it’s exposed as a tool the model calls with a string argument.
LongMemEval: A 116-question benchmark designed to stress agentic memory and retrieval — questions reference earlier turns, require multi-hop search, or hinge on small details easily drowned by noise.
ANN: Approximate Nearest Neighbour — the family of indexes (HNSW, IVF, FAISS) that make vector retrieval fast at the cost of recall vs latency trade-offs. Recall vs speed →
Tool-calling style: How the harness exposes a tool to the model — single shot vs iterative, parallel vs serial, structured-JSON vs free-form. The paper finds this matters more than which retrieval algorithm sits behind the tool.

The news. On May 14, 2026, the "Is Grep All You Need?" paper (arXiv:2605.15184) ran two empirical experiments on agentic search. The first wired both a literal grep tool and a vector-retrieval tool into the same agents across multiple platforms and ran 116 LongMemEval questions through each; across the conditions tested, grep generally yielded higher accuracy than vector retrieval. The second progressively injected irrelevant context to probe robustness. The authors’ headline framing: overall performance is dominated by the agent harness and tool-calling style, not by the retrieval algorithm alone. Read the paper →

Picture the Ctrl-F vs librarian split again. When the user asks "What did the report say about the Paris office in Q3?" the Ctrl-F path is brutally literal — type the string Paris, get every line that contains it, scan the surrounding sentences. The librarian path is semantic — describe the question, the librarian walks the shelves and hands you the three books that "feel close." When the shelf is clean and the question is fuzzy ("anything about European expansion"), the librarian wins. When the shelf has been padded with unrelated chapters and the question hinges on a literal token, Ctrl-F just works.

The animation above runs both tools against the same 32-chunk document store. At low noise, vector retrieval pulls the right 3 chunks — the embeddings line up. As irrelevant context grows (the noise dial on the left fills up), distractor chunks start clustering near the query in vector space, and the top-k starts mixing in chunks marked pink — the wrong ones. Grep keeps emitting the same 3 literal matches regardless of how much unrelated text was added, because literal substring matching is immune to the kind of distributional drift that nudges nearest neighbours around.

The paper’s more interesting claim sits underneath the headline. When the authors swap the agent harness — different platforms, different tool-call shapes, different stop conditions — accuracy moves more than when they swap retrieval algorithms. A well-designed harness running vector retrieval can beat a poorly-designed harness running grep. The retrieval algorithm is one knob among many, and it’s usually not the dominant one. That’s the line worth highlighting: this is not "embeddings are dead," it’s "your harness design and tool-loading strategy deserve more attention than your vector DB does."

Why irrelevant context hurts vector retrieval more

Vector retrieval works by distance in embedding space. Add irrelevant text, and three things happen at once. The query embedding drifts — the model is embedding a longer prompt that now mentions topics the user didn’t ask about. The document distribution shifts — distractor chunks crowd the latent space near the relevant ones. And the top-k cutoff stays fixed at, say, k = 5; if two distractors now sit closer than two relevant chunks, the relevant ones get evicted. The paper’s noise sweep makes this concrete: as noise rises, vector accuracy in the animation drops from roughly 71% to 35% (illustrative) while grep stays flat near 74%. Grep simply doesn’t care how much surrounding text exists; it matches what was typed and returns exactly those spans. The fragility isn’t in embeddings as a technology — it’s in the failure mode where the algorithm assumes a clean shelf and the harness keeps shoving random books onto it.

How the two tools compare inside an agent

Property	Grep tool	Vector retrieval tool
Index built upfront	none (or a trie / inverted index)	embedding pass + ANN index
Query cost	`O(N)` linear scan, fast for <100MB	embed + nearest-neighbour lookup
Strength	literal tokens, code, identifiers, dates, names	paraphrase, synonyms, fuzzy topical match
Failure mode	misses synonyms / rephrasing	top-k crowded out by distractors under noise
Robust to irrelevant context	yes — match is local	no — distance shifts as embeddings move
Agentic fit	cheap to call iteratively, easy to chain with `head` / `tail`	one call returns a fixed top-k bundle

The grid above is not the paper’s tagline. The tagline is: once you put either tool inside an agent loop, the harness eats the difference. A harness that lets the model call the tool again with a refined query, look at a snippet, and decide whether to keep going outperforms a harness that calls the tool once and dumps top-k into context — regardless of which tool sits behind the call.

What this changes for an agent designer

The practical reading is two-sided. Don’t reach for an embedding-index-and-vector-store the moment a project needs "search" — especially if the corpus is under a few hundred megabytes, the queries are literal-token-heavy (code, logs, identifiers, dates), and the agent will call the tool iteratively rather than once. Do still reach for vector retrieval when the corpus is large enough that a linear scan is slow, the queries are genuinely topical or paraphrased, or the agent has a single shot to retrieve. And independently of either choice, spend the design budget on the harness — on how the agent decides to call the tool, when it backs off, and how it threads results back into context. The paper is telling agent designers where the leverage actually lives.

Goes deeper in: AI Agents → Retrieval & RAG → RAG Failure Modes and AI Agents → Context Engineering → Tool-Loading Strategy

Related explainers

AsyncFC — symbolic futures in the decode stream — another way the harness, not the model, decides how much wall-clock a tool call costs
MCP SEP-2663 — async task handles for long-running tool calls — the protocol layer underneath whatever retrieval tool an agent picks