Agents & RAG · Module 06·10 min read

Retrieval-Augmented Generation

Models don’t know what they weren’t trained on, and you can’t retrain them every time your data changes. RAG is the plumbing that lets a model answer questions about documents it has never seen — by fetching the right ones at query time.

Brain Drip EditorsUpdated May 2026·11 references

The five-bullet version

An LLM only knows what was in its training data. New, private, or fast-changing data is invisible.
RAG flips the problem: at query time, fetch relevant text and put it in the prompt.
To fetch, you need a search index over your data. Vector search using embeddings is the modern default.
Documents are chunked, embedded, and stored in a vector database ahead of time.
At query time: embed the question, find the closest chunks, hand them and the question to the model. Answer is grounded.

§ 00 · THE PROBLEM RAG SOLVESWhy the model alone isn’t enough

An LLM is a frozen artifact. Its knowledge is whatever the training corpus contained on the day pretraining stopped. Ask it about your company’s expense policy, a deal you closed last week, or a client’s case file — it has no idea. Ask it about Q4 earnings or a paper published yesterday — same problem. Even a model trained on the public web has a cutoff date and zero access to anything private.

There are three ways to fix this. You can retrainthe model on the new data — expensive, slow, and you’ll need to redo it every time anything changes. You can fine-tune with LoRA — cheaper, but fine-tunes teach style and behavior more than they teach facts, and you still have to repeat the process when the data updates. Or you can fetch the relevant text on demandretrieval-augmented generation. A pattern where, at query time, relevant text is fetched from an external store and injected into the model's prompt so the answer is grounded in current/private information. and shove it into the prompt. That’s RAG.

The bet is simple: a model with a 100-token recipe in front of it will answer cooking questions better than a model that read 100,000 recipes during training. As long as you can find the right text fast, you don’t need the model to memorize anything.

§ 01 · THE PIPELINE AT A GLANCETwo phases, five steps

RAG runs in two phases. The index phase happens once per document (or per-update). The query phase happens on every user request.

Lab · the RAG pipelineClick any step to inspect what’s happening there

employee-handbook.pdf · 240 pages

Section 7.3 — Remote work policy. Employees are eligible to work remotely up to three days per week, subject to manager approval and successful completion of probationary period. Section 7.4 — Time-off accrual. Full-time employees accrue 1.5 days of paid time off per calendar month, prorated for partial-month employment…

Long, mixed-topic text. Way too much for a model to read all at once. Useful information is hidden in specific paragraphs.

Most of the engineering effort is in the boring parts: parsing the document, picking the right chunk boundaries, choosing an embedding model, deciding how many chunks to retrieve, and writing a prompt that actually makes the model use them. The clever-sounding parts (vector search, dense embeddings) are mostly off-the-shelf now.

§ 02 · CHUNK · EMBED · STOREBuilding the index

Chunking.Long documents have to be split into pieces that fit the embedding model’s input window and represent one coherent idea each. Three competing pressures:

Too small:a 50-token chunk has no context. “subject to approval” with nothing before or after is useless to the retriever.
Too big: a 4000-token chunk gets diluted. The embedding represents some average of three topics, and matches mediocre queries about any of them.
Bad boundaries: chunking at fixed character counts slices sentences and ideas in half. Always chunk at semantic boundaries (paragraphs, sections) when you can.

A safe default: 512–1024 tokens per chunk, with 50–100 tokens of overlap between adjacent chunks. Overlap means a topic that lives near a chunk boundary doesn’t get cut in half — it appears whole in one of the two chunks that span it.

Embedding. Each chunk is run through an embedding modelembedding model. A model — usually a small transformer — that produces a single fixed-size vector for a chunk of text. Trained so that semantically similar texts get nearby vectors. Examples: OpenAI text-embedding-3, Cohere Embed v3, BGE, Voyage. and out comes a single fixed-size vector — typically 384, 768, 1024, or 1536 dimensions. The model is trained on millions of (similar text, dissimilar text) pairs so that semantically related content ends up nearby in the vector space.

Two practical notes: match the embedding model to the domain (a code embedder for code, multilingual for non- English), and use the same model for queries and chunks— vectors from different models live in different spaces and won’t compare meaningfully.

Store. Vectors go into a vector database — Pinecone, Qdrant, Weaviate, pgvector, or just a flat numpy array if your corpus is small. The database supports approximate nearest-neighbor search (HNSW, IVF) so you can find the closest k chunks to any query vector in milliseconds, even with millions of entries.

§ 03 · RETRIEVE · RERANK · GENERATEAnswering a question

On every request:

Embed the question with the same model used during indexing.
Search the vector database for the top-k chunks (typically k = 3 to 10).
Construct the prompt: a brief instruction, the retrieved chunks, then the user’s question. A common template:“Use only the context below to answer. If the answer isn’t there, say so.”
Generatewith the LLM. The answer should be grounded in the retrieved chunks rather than the model’s training data.

Fig 1The end-to-end query flow. The diagram is small enough that most engineering teams should be able to implement a v1 in an afternoon.

Two follow-ups people often add to step 4: citations (have the model name which chunk it used, for trust) and follow-up retrieval(let the model do another round of search if it decides the first one wasn’t enough — covered in the Advanced RAG lesson).

§ 04 · WHERE NAIVE RAG BREAKSWhat you’ll see when this doesn’t work

The simple recipe above will get you to a working demo in a day. It will start failing in production for predictable reasons. Some of them:

The right chunk isn’t in the top-k. Semantic similarity is rough. The chunk that contains the answer is sometimes phrased differently than the question. Fix: rerank with a cross-encoder; or expand the query; or hybrid (lexical + dense) search.
Multi-hop questions.“Which director made the movie that won Best Picture in the year my CEO was born?” needs three lookups, chained. Single-shot retrieval can’t handle it.
Conflicting chunks. Two retrieved chunks contradict each other. The model often picks one arbitrarily.
The model ignores the context. If pre-training was strong on a topic, the model may answer from training instead of from the retrieved chunks — particularly when the question is straightforward.
Quality decays with corpus size. A 100-doc RAG system is much easier than a 10,000-doc one. Noise grows; the right chunk has more competition.

CHECKYou're building a RAG system over 200 legal contracts. Your embedding model is bge-small-en-v1.5. A user asks 'What's the indemnity cap in our standard SaaS contract?' The model answers, but cites the wrong chunk. What's the most common cause?

§ 05 · TAKING THIS FORWARDWhat to build next

Three follow-ups in roughly priority order:

Get evals running early. A small set of golden questions with known-good answers, scored automatically. RAG is unusually easy to tune by vibes and unusually painful to fix by vibes — the metric matters.
Add a reranker. Pull top-50 with the embedding search, then rerank to top-5 with a cross-encoder. Usually the single biggest quality jump per dollar.
Make retrieval observable. Log every (question, retrieved chunks, answer) tuple. The debugging asymmetry is enormous — you cannot diagnose a bad answer without seeing what was retrieved.

Everything in the Advanced RAG lesson — reranking, hybrid search, query rewriting, multi-step retrieval — builds on top of the basic pipeline you have here.

§ · GOING DEEPERWhere RAG actually breaks in production

Naive RAG fails in predictable ways. The most common: the retriever returns adjacent-but-wrong chunks because dense embedding similarity confuses topic with answer. The second most common: chunks are too big (so the embedding represents an average of several ideas) or too small (so they lack the surrounding context the LLM needs to interpret them). Lewis et al.’s original RAG paper (2020) demonstrated the pattern; the operational discipline came later.

Two upgrades make most production systems noticeably better. First, a cross-encoder reranker(Khattab & Zaharia 2020) that scores (query, chunk) pairs directly on top of the dense recall stage — the second-stage signal is dramatically more reliable than vector similarity alone. Second,hybrid retrieval that fuses BM25 (term-based) with dense vectors via Reciprocal Rank Fusion — covers acronyms, code identifiers, and rare nouns that embeddings smooth over. Both are off-the-shelf and almost always worth the latency.

§ · FURTHER READINGReferences & deeper sources

Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks · NeurIPS
Karpukhin et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering · EMNLP
Khattab, Zaharia (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT · SIGIR
Izacard, Grave (2020). Leveraging Passage Retrieval with Generative Models for Open Domain QA (FiD) · EACL
Gao et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.