Advanced RAG
Naive RAG fails in five predictable ways. Each failure has a standard fix — reranking, hybrid search, query rewriting, multi-step retrieval. This is the tour of what production RAG actually looks like.
The five-bullet version
- Naive RAG retrieves the wrong chunk often enough that production systems need a second pass.
- Reranking: pull top-50 with cheap vector search, rerank to top-5 with a cross-encoder. Biggest single quality jump.
- Hybrid search: combine BM25 (lexical) with dense vectors. Each catches what the other misses.
- Query rewriting / HyDE: transform the user’s question into something the index will match better.
- Multi-step retrieval: let the model retrieve, read, then retrieve again with what it now knows.
§ 00 · WHERE NAIVE RAG RUNS OUTFive failure modes you’ll meet
If you build the pipeline from the RAG lesson, hand it to real users, and watch the failures, you’ll see the same five patterns repeat:
- Right answer, wrong chunk.The chunk that contains the answer doesn’t make it into the top-
k. The query and the answer are phrased differently enough that vector similarity alone misses. - Exact terms don’t match.The user asks about “SOC 2 Type II” and the relevant chunk uses “Service Organization Controls 2 (Type II).” Semantic similarity is fine but not perfect; lexical search would have nailed it.
- Multi-hop questions. The user asks something that needs two or three retrievals chained — find document A, see it references topic B, retrieve about B, combine.
- Ambiguous queries.“What did we decide?” carries no nouns. The retrieval embedding is essentially noise.
- Long-tail topics. A topic mentioned in one chunk in a 100k-chunk corpus competes with a hundred near-misses. Top-5 retrieval drowns it.
Each fix below addresses one or more of these. The right combination depends on your data and queries — don’t add all of them prophylactically. Each one adds latency and complexity.
§ 01 · RERANK — THE 2-STAGE RETRIEVERRetrieve cheap, rerank expensive
Embedding-based search is fast because it’s a single distance comparison per chunk. It’s also imperfect: a cosine similarity of 0.78 doesn’t reliably mean better than 0.72.
A cross-encodercross-encoder reranker. A model that takes a (query, chunk) pair as joint input and outputs a single relevance score. Slower than dual-encoder retrieval because it can't be precomputed — but far more accurate per query. looks at the query and a candidate chunk together, in one forward pass, and outputs a single relevance score. It can’t be precomputed (the chunk and the query interact at every layer), so it’s slow. But scored one-at-a-time, it’s much more accurate than vector similarity.
The standard 2-stage pattern:
- First-stage retrieval: pull the top 50–100 chunks with cheap vector search (or hybrid, below).
- Second-stage rerank: run those 50 through a cross-encoder. Keep the top 5 by reranker score. Hand those to the LLM.
Common rerankers: Cohere Rerank, BAAI/bge-reranker, Voyage. Most are small enough to self-host. The quality jump from adding a reranker is usually the single largest in the entire RAG stack — 10–20 points on retrieval metrics like NDCG, even with the embedding model unchanged.
§ 02 · HYBRID — SPARSE + DENSEBM25 still earns its keep
Vector embeddings shine at paraphrases— when the query and the answer use different words for the same idea. They struggle when the query contains a rare term, a product name, a code identifier, or a number. The embedding fuzzes all of those into “there is a token here that looks code-y” and similar chunks score identically.
BM25BM25. The classic information-retrieval scoring function. Counts how often query terms appear in a document, weighted inversely by how common those terms are in the corpus. Old, simple, sturdy — and still competitive when exact-term matching matters. — the lexical workhorse of the pre-deep-learning era — does the opposite: it scores high on exact-term overlap. Bad at paraphrases, great at “the user typed the exact phrase.” Modern hybrid retrieval runs both, scales the scores, and combines them — typically with a simple weighted sum or with Reciprocal Rank Fusion (RRF).
The combination is more than the sum of parts because BM25 catches the queries dense search misses, and vice versa. Concretely:
- A query like “refund policy” benefits from BM25 — exact words, common terminology.
- A query like “getting money back after dispute” benefits from dense — the doc says “refund” but the user didn’t.
- Both run, both scores get fused, the right chunk wins under either phrasing.
Dense vector similarity only. Picks chunks that are semantically near the query — but misses exact-phrase matches.
Notice the “office holiday” chunk drops out under hybrid and rerank — it has semantic overlap with “offsite”, but no lexical match and the cross-encoder rejects it. Same query, very different answer depending on what you pulled.
§ 03 · QUERY REWRITING & HYDEMake the question match what’s indexed
Sometimes the user’s question is the problem. A short, ambiguous, or unusually phrased query produces a bad embedding. Better to rewrite the query before retrieval.
Three common rewriting patterns:
- Decomposition.“Compare our 2024 and 2025 cloud spend by region” → two retrievals (2024, 2025), each per-region. The model decomposes the question into parallel sub- queries, retrieves each, combines.
- Expansion.“ETA on Project Falcon?” → retrieval looks for “Falcon timeline”, “Falcon delivery”, “Falcon roadmap.” Cast a wider net by generating synonyms before searching.
- HyDEHyDE. Hypothetical Document Embeddings — ask the LLM to write a fake answer to the query, then embed that answer (not the question) and use it to retrieve. The hypothetical answer is in the same 'shape' as real answers, which embeds closer to the right chunks. (hypothetical document embeddings). Ask the LLM to generate a fake answer to the question. Embed that fake answer (not the question). Use that vector for retrieval. The hypothetical answer looks more like the real chunk than the question does — so it embeds closer to it. Sounds backwards; works well.
Query rewriting is one of those changes that’s easy to add and hard to evaluate by eye. The temptation is to ship it because it feels smart. Make sure the eval shows it’s actually helping — for simple corpuses and clear questions, it can degrade quality by introducing noise.
§ 04 · MULTI-STEP & AGENTIC RETRIEVALWhen one round isn’t enough
Some questions can’t be answered in a single retrieve-then- generate step:
- Multi-hop facts.“What’s the address of the firm that audits our biggest customer?” — needs two lookups, chained.
- Comparison.“Which of our products has the fastest deploy time?” — requires retrieving each product’s deploy details and comparing.
- Refining.The first retrieval returns a chunk that says “see appendix C.” A second retrieval finds appendix C.
The pattern is to let the LLM drive retrieval. The model receives the question, decides what to retrieve, looks at what came back, decides what to retrieve next, and finally generates an answer. This is one slice of what people mean when they say agentic RAGagentic RAG. A RAG variant where the LLM iteratively decides what to retrieve next, rather than running one fixed retrieval step. The model can issue follow-up searches, call tools, or refine its query based on what it sees.:
- Give the LLM a
search(query)tool. - Let it call
searchas many times as it needs, observing the results between calls. - When it has enough, it generates the answer.
Trade-offs are immediate: latency grows linearly with hop count, cost too, and bugs become harder to reproduce because the search trace can differ between runs. For most production systems, two-stage retrieval with reranking solves enough of the problem that you may never need multi-step. Pull this lever when you have specific evidence that single-step retrieval is blocking you.
§ 05 · PICKING WHAT TO ADD, AND WHENA rough priority order
If you’re starting from naive RAG, this is roughly the order to add things in, based on quality lift per unit of engineering pain:
- Evals first.A small set of golden questions, scored automatically. You cannot improve what you don’t measure.
- Reranking. Two-stage with a cross-encoder. Almost always the biggest single jump. Easy to add.
- Hybrid retrieval. BM25 + dense, fused with RRF. Especially useful for technical / structured content.
- Better chunking.Boring; high-impact. Semantic boundaries, parent-document metadata, smaller chunks for precision + larger ones for context (“parent document retrieval”).
- Query rewriting. Worth trying if your evals show short or ambiguous queries hurting accuracy.
- Multi-step / agentic. Reserve for when nothing else handles your question shape.
§ 06 · TAKING THIS FORWARDWhere to look next
Three frontier topics, in roughly increasing distance from current practice:
- Knowledge graph augmentation. Build an explicit graph of entities and relations alongside the vector index; let retrieval traverse it for multi-hop. Especially useful for legal, medical, and biotech corpuses where relations matter.
- Long-context models as a partial RAG replacement. 1M-token context windows let you put entire books into the prompt. Doesn’t make RAG obsolete (cost scales with context length), but changes the trade-off — you can be lazier about retrieval if you can afford the tokens.
- RAG + memory.A retrieval index over the assistant’s past conversations with the user. Personalization without a full fine-tune. The line between “RAG” and “agent memory” is getting blurry.
The honest summary: RAG is a moving target because what you can cheaply put in the context window is growing every six months, and what counts as a “good chunk” depends on the model. Build the boring parts well (chunking, evals, observability), keep the clever parts swappable.
§ · GOING DEEPERProduction-grade retrieval patterns
Three patterns compound. Two-stage retrieval: cheap recall (vector or hybrid) over the full corpus to find top-100, then a cross-encoder reranker scores those to find top-5. The reranker is much more expensive per query but you only call it on the survivors. Query rewriting and HyDE(Wang et al. 2023): use the LLM to generate a hypothetical answer, then embed that, then retrieve — embedding answers against answers is more reliable than questions against answers.
Multi-step / agentic retrieval — give the LLM a search(query)tool and let it decide what to fetch next. For multi-hop questions (“what year was the company founded that bought X?”) you need this. FLARE (Jiang et al. 2023), IRCoT (Trivedi et al. 2023), and Self-RAG (Asai et al. 2023) are the canonical implementations. The catch: latency multiplies with hops. Cap loop iterations to keep tail latency bounded.
§ · FURTHER READINGReferences & deeper sources
- (2020). ColBERT · SIGIR
- (2009). Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods · SIGIR
- (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE) · ACL
- (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection · ICLR
- (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.