Agents & RAG · Module 07·13 min read

Advanced RAG

Naive RAG fails in five predictable ways. Each failure has a standard fix — reranking, hybrid search, query rewriting, multi-step retrieval. This is the tour of what production RAG actually looks like.

Brain Drip EditorsUpdated May 2026·19 references

The five-bullet version

Naive RAG retrieves the wrong chunk often enough that production systems need a second pass.
Reranking: pull top-50 with cheap vector search, rerank to top-5 with a cross-encoder. Biggest single quality jump.
Hybrid search: combine BM25 (lexical) with dense vectors. Each catches what the other misses.
Query rewriting / HyDE: transform the user’s question into something the index will match better.
Multi-step retrieval: let the model retrieve, read, then retrieve again with what it now knows.

§ 00 · WHERE NAIVE RAG RUNS OUTFive failure modes you’ll meet

If you build the pipeline from the RAG lesson, hand it to real users, and watch the failures, you’ll see the same five patterns repeat:

Right answer, wrong chunk.The chunk that contains the answer doesn’t make it into the top-k. The query and the answer are phrased differently enough that vector similarity alone misses.
Exact terms don’t match.The user asks about “SOC 2 Type II” and the relevant chunk uses “Service Organization Controls 2 (Type II).” Semantic similarity is fine but not perfect; lexical search would have nailed it.
Multi-hop questions. The user asks something that needs two or three retrievals chained — find document A, see it references topic B, retrieve about B, combine.
Ambiguous queries.“What did we decide?” carries no nouns. The retrieval embedding is essentially noise.
Long-tail topics. A topic mentioned in one chunk in a 100k-chunk corpus competes with a hundred near-misses. Top-5 retrieval drowns it.

Each fix below addresses one or more of these. The right combination depends on your data and queries — don’t add all of them prophylactically. Each one adds latency and complexity.

§ 01 · RERANK — THE 2-STAGE RETRIEVERRetrieve cheap, rerank expensive

Embedding-based search is fast because it’s a single distance comparison per chunk. It’s also imperfect: a cosine similarity of 0.78 doesn’t reliably mean better than 0.72.

A cross-encodercross-encoder reranker. A model that takes a (query, chunk) pair as joint input and outputs a single relevance score. Slower than dual-encoder retrieval because it can't be precomputed — but far more accurate per query. looks at the query and a candidate chunk together, in one forward pass, and outputs a single relevance score. It can’t be precomputed (the chunk and the query interact at every layer), so it’s slow. But scored one-at-a-time, it’s much more accurate than vector similarity.

The standard 2-stage pattern:

First-stage retrieval: pull the top 50–100 chunks with cheap vector search (or hybrid, below).
Second-stage rerank: run those 50 through a cross-encoder. Keep the top 5 by reranker score. Hand those to the LLM.

Fig 1Latency budget: first stage in 10–50ms, rerank in 100–300ms. The rerank cost is paid only on the 50 survivors.

Common rerankers: Cohere Rerank, BAAI/bge-reranker, Voyage. Most are small enough to self-host. The quality jump from adding a reranker is usually the single largest in the entire RAG stack — 10–20 points on retrieval metrics like NDCG, even with the embedding model unchanged.

§ 02 · HYBRID — SPARSE + DENSEBM25 still earns its keep

Vector embeddings shine at paraphrases— when the query and the answer use different words for the same idea. They struggle when the query contains a rare term, a product name, a code identifier, or a number. The embedding fuzzes all of those into “there is a token here that looks code-y” and similar chunks score identically.

BM25BM25. The classic information-retrieval scoring function. Counts how often query terms appear in a document, weighted inversely by how common those terms are in the corpus. Old, simple, sturdy — and still competitive when exact-term matching matters. — the lexical workhorse of the pre-deep-learning era — does the opposite: it scores high on exact-term overlap. Bad at paraphrases, great at “the user typed the exact phrase.” Modern hybrid retrieval runs both, scales the scores, and combines them — typically with a simple weighted sum or with Reciprocal Rank Fusion (RRF).

The combination is more than the sum of parts because BM25 catches the queries dense search misses, and vice versa. Concretely:

A query like “refund policy” benefits from BM25 — exact words, common terminology.
A query like “getting money back after dispute” benefits from dense — the doc says “refund” but the user didn’t.
Both run, both scores get fused, the right chunk wins under either phrasing.

Lab · retrieval strategiesSame query, three strategies — watch the ranks shift

QueryCan I expense a 2-day offsite for my team?

Dense vector similarity only. Picks chunks that are semantically near the query — but misses exact-phrase matches.

#1c02Annual budget — team training & development line item…0.71

#2c03Reimbursement procedure for individual travel days…0.64

#3c01Travel & expense policy: pre-approved offsite events under $5k…0.62

#4c04Onsite vs offsite event guidance for managers…0.58

#5c05Office holiday calendar — Q3 closures…0.41

#6c06Vendor contract templates for catering services…0.36

Notice the “office holiday” chunk drops out under hybrid and rerank — it has semantic overlap with “offsite”, but no lexical match and the cross-encoder rejects it. Same query, very different answer depending on what you pulled.

§ 03 · QUERY REWRITING & HYDEMake the question match what’s indexed

Sometimes the user’s question is the problem. A short, ambiguous, or unusually phrased query produces a bad embedding. Better to rewrite the query before retrieval.

Three common rewriting patterns:

Decomposition.“Compare our 2024 and 2025 cloud spend by region” → two retrievals (2024, 2025), each per-region. The model decomposes the question into parallel sub- queries, retrieves each, combines.
Expansion.“ETA on Project Falcon?” → retrieval looks for “Falcon timeline”, “Falcon delivery”, “Falcon roadmap.” Cast a wider net by generating synonyms before searching.
HyDEHyDE. Hypothetical Document Embeddings — ask the LLM to write a fake answer to the query, then embed that answer (not the question) and use it to retrieve. The hypothetical answer is in the same 'shape' as real answers, which embeds closer to the right chunks. (hypothetical document embeddings). Ask the LLM to generate a fake answer to the question. Embed that fake answer (not the question). Use that vector for retrieval. The hypothetical answer looks more like the real chunk than the question does — so it embeds closer to it. Sounds backwards; works well.

Query rewriting is one of those changes that’s easy to add and hard to evaluate by eye. The temptation is to ship it because it feels smart. Make sure the eval shows it’s actually helping — for simple corpuses and clear questions, it can degrade quality by introducing noise.

§ 04 · MULTI-STEP & AGENTIC RETRIEVALWhen one round isn’t enough

Some questions can’t be answered in a single retrieve-then- generate step:

Multi-hop facts.“What’s the address of the firm that audits our biggest customer?” — needs two lookups, chained.
Comparison.“Which of our products has the fastest deploy time?” — requires retrieving each product’s deploy details and comparing.
Refining.The first retrieval returns a chunk that says “see appendix C.” A second retrieval finds appendix C.

The pattern is to let the LLM drive retrieval. The model receives the question, decides what to retrieve, looks at what came back, decides what to retrieve next, and finally generates an answer. This is one slice of what people mean when they say agentic RAGagentic RAG. A RAG variant where the LLM iteratively decides what to retrieve next, rather than running one fixed retrieval step. The model can issue follow-up searches, call tools, or refine its query based on what it sees.:

Give the LLM a search(query) tool.
Let it call search as many times as it needs, observing the results between calls.
When it has enough, it generates the answer.

Fig 2Multi-step retrieval. Each loop costs another LLM call and another retrieval, so cap it (3 hops max is a common default) to bound latency and cost.

Trade-offs are immediate: latency grows linearly with hop count, cost too, and bugs become harder to reproduce because the search trace can differ between runs. For most production systems, two-stage retrieval with reranking solves enough of the problem that you may never need multi-step. Pull this lever when you have specific evidence that single-step retrieval is blocking you.

§ 05 · PICKING WHAT TO ADD, AND WHENA rough priority order

If you’re starting from naive RAG, this is roughly the order to add things in, based on quality lift per unit of engineering pain:

Evals first.A small set of golden questions, scored automatically. You cannot improve what you don’t measure.
Reranking. Two-stage with a cross-encoder. Almost always the biggest single jump. Easy to add.
Hybrid retrieval. BM25 + dense, fused with RRF. Especially useful for technical / structured content.
Better chunking.Boring; high-impact. Semantic boundaries, parent-document metadata, smaller chunks for precision + larger ones for context (“parent document retrieval”).
Query rewriting. Worth trying if your evals show short or ambiguous queries hurting accuracy.
Multi-step / agentic. Reserve for when nothing else handles your question shape.

CHECKYour RAG eval score is plateauing. You've already added reranking. What's the next-highest-leverage change?

§ 06 · TAKING THIS FORWARDWhere to look next

Three frontier topics, in roughly increasing distance from current practice:

Knowledge graph augmentation. Build an explicit graph of entities and relations alongside the vector index; let retrieval traverse it for multi-hop. Especially useful for legal, medical, and biotech corpuses where relations matter.
Long-context models as a partial RAG replacement. 1M-token context windows let you put entire books into the prompt. Doesn’t make RAG obsolete (cost scales with context length), but changes the trade-off — you can be lazier about retrieval if you can afford the tokens.
RAG + memory.A retrieval index over the assistant’s past conversations with the user. Personalization without a full fine-tune. The line between “RAG” and “agent memory” is getting blurry.

The honest summary: RAG is a moving target because what you can cheaply put in the context window is growing every six months, and what counts as a “good chunk” depends on the model. Build the boring parts well (chunking, evals, observability), keep the clever parts swappable.

§ · GOING DEEPERProduction-grade retrieval patterns

Three patterns compound. Two-stage retrieval: cheap recall (vector or hybrid) over the full corpus to find top-100, then a cross-encoder reranker scores those to find top-5. The reranker is much more expensive per query but you only call it on the survivors. Query rewriting and HyDE(Wang et al. 2023): use the LLM to generate a hypothetical answer, then embed that, then retrieve — embedding answers against answers is more reliable than questions against answers.

Multi-step / agentic retrieval — give the LLM a search(query)tool and let it decide what to fetch next. For multi-hop questions (“what year was the company founded that bought X?”) you need this. FLARE (Jiang et al. 2023), IRCoT (Trivedi et al. 2023), and Self-RAG (Asai et al. 2023) are the canonical implementations. The catch: latency multiplies with hops. Cap loop iterations to keep tail latency bounded.

§ · FURTHER READINGReferences & deeper sources

Khattab, Zaharia (2020). ColBERT · SIGIR
Cormack, Clarke, Buettcher (2009). Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods · SIGIR
Wang et al. (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE) · ACL
Asai et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection · ICLR
Edge et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.