Retrieval & RAG for AI Agents
Why Retrieval?
What is RAG?
Retrieval-Augmented Generation pulls relevant chunks of external knowledge into the model's context window before it answers, so the reply is grounded in those chunks rather than the model's frozen weights. The pattern was named in Lewis et al. (2020) and is now the default way to attach a model to any corpus it was not trained on. The shape is two boxes — a retriever that fetches passages, and the same generator you already use — wired so the retriever's output lands in the generator's prompt.
What it solves
A pretrained model knows three things badly. Recent facts: anything after the training cutoff is invisible — yesterday's incident, last quarter's pricing change. Private documents: your wiki, your CRM, your contracts were never in the public crawl. Citation requirements: a model that pulled a claim from training has no source to point at, so an answer that needs a footnote can't have one. RAG fixes all three by moving the answer's source from the weights into the prompt, where it's both fresh and quotable.
Where it sits in the agent stack
Module 1 framed an agent as a loop that calls tools to gather context. RAG is the most common such tool — search_docs(query) or query_kb(query) — and the chunks it returns become part of the next prompt. Module 5 picks up where the chunks land: context engineering is everything you do with the retrieved text once it arrives — ordering, deduplication, truncation, system-prompt placement.
The catch
Retrieval moves the failure surface. With no retrieval, a wrong answer is the model's fault; with retrieval, a wrong answer is usually the retrieval pipeline's fault — the wrong chunk landed in the prompt, and the model dutifully grounded a confident reply in it. The model becomes more knowledgeable and a new class of failures appears underneath it. The rest of this module walks where that pipeline goes wrong: text becoming vectors (Step 2), the end-to-end pipeline (Step 3), chunking (Step 4), approximate nearest neighbor search (Step 5), and the failure modes that survive a tuned pipeline (Step 6).
Try it: Pick a query in the right pane and compare the two panels. Without retrieval, the model fabricates a number, refuses, or gives generic advice that misses the actual policy. With one retrieved chunk it cites the correct answer. Same query, same model, different inputs.
Embeddings as Coordinates
What does it mean for two chunks to be "similar"?
Every chunk gets fed through an embedding model and comes out as a vector — a fixed-length list of numbers, typically 768 to 3,072 wide depending on the model (all-MiniLM-L6-v2 is 384, text-embedding-3-large is 3,072, voyage-3 is 1,024). Chunks whose meaning overlaps land at nearby coordinates in that vector space; chunks about unrelated things land far apart. Cosine similarity measures the angle between two vectors as a single number in [-1, 1] — 1 means same direction (synonyms), 0 means orthogonal (unrelated), negative means opposite.
What this buys retrieval
When a user query arrives at runtime, you embed it with the same model that embedded the corpus, compute cosine similarity against every chunk vector, and return the top-k highest scores. The whole retrieval pipeline rides on one geometric fact: meaning is preserved as direction. No keyword overlap required — "rotate the SSO certificate" can match oauth-setup.md even with zero shared words.
You already learned the math — this step is the debugging intuition
Embeddings as a concept are taught from first principles in LLM Internals: Embeddings — that's where the lookup table, the dimensionality trade, and cosine similarity live. The job of this step is the retrieval-debugging instinct: when a chunk doesn't show up for a query that should match, the most common cause is that one of the two got mapped to a corner of the space the other doesn't reach. The model isn't broken; the geometry is.
Two corpora, two cluster shapes
Different domains produce different cluster topology. A customer-support corpus splits along issue categories — billing tickets cluster tightly, shipping tickets cluster tightly, and the two sit on opposite sides of the space. An engineering wiki splits along subsystems instead, with cross-cutting concerns like "monitoring" landing in a thinner band between deployment and database docs because it touches both. The choice of embedder and the corpus together set the geometry; changing either reshapes which queries can reach which chunks.
Try it: Drag the violet QUERY point in the right pane. Its top-3 nearest chunks update live with cosine similarity scores, and the connecting lines fade with similarity. Toggle corpora to see how the cluster shape changes when the vocabulary changes — the same dragging gesture lands on a different geometry, with a different mix of nearest documents.
Step 3 walks the full retrieve-then-generate pipeline end to end — query in, answer out, chunks in the middle.
Retrieve-Then-Generate
The five steps, and which one is the model
Retrieve-then-generate is a five-step pipeline: the user query arrives, gets embedded, searches the index for top-k chunks, the chunks are assembled into a prompt, the model generates the answer. Four of those steps are retrieval infrastructure; one is the model. The instinct after a few months of shipping RAG is the inversion of the obvious one — most of your quality wins live in steps 1-4, and treating the model call as the interesting bit is how you spend a quarter tuning prompts when the bug was in the search step the whole time.
The pipeline, named
| Step | What happens | Failure-mode preview |
|---|---|---|
| 1. Query | user's raw text arrives | wrong phrasing → wrong neighborhood |
| 2. Embed | same embedder as the corpus turns text into a vector | different embedder than the corpus → meaningless dot products |
| 3. Search | top-k chunks by cosine similarity | k too small → right chunk at rank k+1; k too big → noise drowns signal |
| 4. Assemble | chunks glued into the prompt with a system message | no “answer only from chunks” instruction → the model leaks training data |
| 5. Generate | model produces the answer | chunks were wrong → confidently grounded wrong answer |
"Stuff into context" is doing more work than it sounds
Step 4 looks mechanical — concatenate strings, call the model — but it carries the whole grounding contract. The retrieved chunks become part of the prompt, and the model has no built-in way to distinguish facts from your wiki from facts from training data. You have to tell it: answer only from the chunks below; cite the filename for every claim; refuse if the chunks don't cover the question. Without that instruction, the model will happily blend training-data plausibility with one of your chunks and hand back a confident hybrid that points at a real source for the wrong half of the answer.
The cost line
Every retrieved chunk is tokens you pay for on input and tokens that compete for the model's attention budget. More chunks does not mean better answers — past some k, signal-to-noise drops, latency rises, and the right chunk at rank 3 gets ignored next to two near-duplicates at rank 1 and 2. Module 5 (Context Engineering) is the whole study of where on that curve to sit.
Try it: Click any stage card in the right pane to inspect its input and output. The actual model call is the last card — the other four decide whether it has a chance. Hit ▶ run pipeline to watch all five light up in sequence.
Steps 4-6 zoom into the three biggest variables in this pipeline: chunking (how the corpus was split before any of this ran), the retrieval algorithm (how step 3 finds the top-k without comparing against every vector), and the failure modes that survive a tuned pipeline.
Chunking Strategies
Step 3 walked the runtime path — query in, answer out. This step steps backward to indexing time: what happens to the corpus before any of that runtime pipeline runs. The source documents have to be cut into pieces; each piece becomes one row in the index, one searchable unit, one thing the retriever can return or skip. The slicing rule is the design decision. Two pipelines pointed at the same docs with the same embedder retrieve different chunks if they chunked differently, write different prompts, and produce different answers. Chunking is not a preprocessing detail; it's the silent first hyperparameter of the whole pipeline.
Three ways to slice
| Strategy | How it splits | Best for |
|---|---|---|
| Fixed-size | every N tokens or characters, regardless of structure | uniform corpora, fast indexing, baseline |
| Sentence / paragraph | respects natural boundaries — one sentence or one paragraph per chunk | prose, FAQs, policy docs |
| Semantic | splits where the embedding shifts topic; more expensive, runs the embedder during indexing | long-form articles, mixed-topic pages |
LangChain's text splitters ship implementations of all three plus structure-aware variants for code and Markdown.
The tradeoff that picks the size
Small chunks are precise — the answer-bearing sentence is the chunk, so similarity to a well-phrased query is high — but they lose context: "Items damaged by misuse are not covered" means little without the warranty section around it. Large chunks invert both: context-rich (the answer arrives wrapped in its paragraph) and noisy (the answer is one sentence inside text that's mostly irrelevant, so cosine similarity gets diluted and a tighter-topic doc can rank above the right one).
The sweet spot depends on the document and the query, which is why this is a tunable knob, not a default. A FAQ with one question per paragraph wants paragraph-sized chunks; a policy doc with many sub-rules wants smaller; a mixed knowledge base usually needs to try a few and measure.
Overlap, the operational fix
Sentence-aware chunking still slices answers in half when an answer spans a boundary. The standard mitigation is chunk overlap — a sliding window that repeats the last ~10-20% of one chunk at the start of the next. The cost is a slightly larger index and some duplicate text in results; the win is that an answer cut at a boundary survives in at least one of the two adjacent chunks.
Try it: Ask the same query at each chunk size. At tiny, the answer sentence is sliced across boundaries — the top-2 are fragments that don't carry the full instruction. At small, the answer-bearing sentence is its own chunk at rank 1. At large, the whole doc is the only chunk — top-1 by default, but the similarity score drops because most of the chunk is unrelated text. Same doc, same query; chunk size decides whether the answer reaches the model.
Step 5 covers the algorithm that searches the chunked corpus — and the recall-vs-speed knob it offers in exchange for not comparing the query against every chunk.
ANN — Recall vs Speed
Search has two ends of a spectrum. Brute force compares the query against every vector in the index — O(N), exact, and slow at scale. Approximate nearest neighbor (ANN) uses a clever index that finds the closest vectors in roughly logarithmic time (sub-linear in practice; the actual curve depends on the algorithm and tuning) — fast, but might miss a few of the truly-closest docs. The interesting question isn't which is better; it's where the crossover sits for your corpus.
Why brute force breaks down
The numbers are unforgiving. At 1k docs, comparing the query against everything takes microseconds — brute force is fine. At 100k, the same scan takes hundreds of milliseconds, which is already too slow for an interactive query. At 10M, you're looking at tens of seconds per query — unusable. ANN flattens that curve: across all three regimes it stays in the single-digit milliseconds, regardless of N.
HNSW in one paragraph
The dominant ANN algorithm in production is HNSW — Hierarchical Navigable Small World graphs. At index time, build a graph where each vector knows its handful of closest neighbors. At query time, start anywhere and greedily hop to whichever neighbor is closest to the query. Because the graph has a small-world structure, you converge on the nearest doc in roughly log(N) hops on average — fast enough that the per-query cost barely moves as the corpus grows from a hundred thousand to ten million docs. The "approximate" part is that greedy hopping can land in a local minimum and miss the global nearest — typical settings achieve around 95% recall@10, but the actual number depends on ef_search and how the graph was built. The other common index family is IVF, which clusters vectors and only scans the closest few clusters.
The tunable knob
ANN gives you a slider brute force doesn't: more graph traversals (ef_search) means higher recall but slower queries. The sweet spot depends on whether your downstream model can tolerate the occasional missed doc — generative models with strong reasoning often shrug off a 5% recall miss; strict-citation pipelines won't.
What's in production
Real systems converge on a small set of choices: pgvector (HNSW inside Postgres), Pinecone, Weaviate, Qdrant, Milvus, FAISS in-process. All implement HNSW or IVF or both. The decision is operational — managed vs self-hosted, scale, query patterns — not algorithmic.
HNSW internals (ef_construction, M, the layered graph), hybrid search (BM25 + dense), reranking, and eval-driven index tuning live in Track B: Retrieval at Production Depth (planned). For Foundations, the slider intuition is enough.
Try it: Toggle the corpus size in the right pane. At 1k, brute force wins — graph overhead makes HNSW slightly slower at this scale. At 100k, HNSW pulls ahead by ~100×. At 10M, HNSW is the only viable option. Then drag the recall slider to see the tradeoff — moving from 80% to 99% costs a fraction of a millisecond, but at scale that's still orders of magnitude faster than the linear scan.
RAG Failure Modes
Retrieval moves the failure surface. Without it, the model fails — it refuses, fabricates, or falls back on generic priors. With it, the model usually answers fine, but the pipeline behind it has its own failure modes. Three of them account for most "RAG isn't working" reports.
The three failures, named
| Failure | Symptom | Where it came from |
|---|---|---|
| Irrelevant retrieval | confident answer about the wrong topic | top-k didn’t contain the answer chunk |
| Lost-in-the-middle | “I couldn’t find that” when the answer was right there | answer chunk landed in the middle of a long context |
| Stuffed-context noise | hedged, partial, weaker-than-it-should-be reply | too many chunks competing for attention |
Irrelevant retrieval — the most common one
Top-k didn't contain the answer chunk, and the model dutifully answered using whatever was in the prompt. The reply reads fluent and cited; it's just about the wrong thing. Usual causes: chunking that buried the answer-bearing sentence inside a larger noisy chunk, an embedding model that doesn't know the domain (generic embedders are weak on medical, legal, internal-jargon corpora), or a query phrased differently from the corpus — "warranty" vs "limited liability coverage". Fixes: revisit chunking (Step 4), swap to a domain-tuned embedder, or add a reranker that re-scores top-50 before truncating to top-3 (Track B).
Lost-in-the-middle
Liu et al. (2023) placed the answer at different positions within a long context and measured accuracy. It was highest at the very start or end and dipped sharply in the middle — even on models advertising 100k+ token windows. With top-10 retrieval, slots 4–7 are the danger zone: the retriever does its job, returns the right chunk, and the model still misses it. Fixes: reorder chunks by relevance (most-relevant first or last), compress the middle into 1-line summaries, or just drop k.
Stuffed-context noise and the practical ceiling
The instinct after a missed answer is to raise k. It usually makes things worse — each extra chunk competes for the model's attention budget, so doubling k from 5 to 10 dilutes the chunk that mattered. Aider's Paul Gauthier has reported that practical accuracy ceilings often appear well below the advertised window — a roofline-style guideline, not a hard limit. The rule: use the smallest k that holds the answer.
Try it: Cycle through the three failure modes in the right pane. Same query, three different ways the pipeline can break. Click + reveal on the fix pane to see the one-line mitigation — chunking and reranking for irrelevant retrieval, reorder/compress for lost-in-the-middle, smaller k for stuffed context.
Module wrap-up
Six steps, one thesis. Retrieval is the workflow primitive most agents reach for first, but its quality is set by the four mechanical steps before generation: chunk, embed, search, assemble. The model is the smallest variable in that chain — most "RAG isn't working" reports trace back upstream to a chunking decision, not to the model's reasoning. Module 5 (Context Engineering) takes it from here: once you can retrieve well, the discipline becomes filling the context window with just the right info for the next step. Anthropic's effective context engineering post is the recommended primer.