Agentic Hybrid RAG
Two complementary improvements over naive RAG — hybrid sparse-dense retrieval, and an LLM-driven loop — stacked. The result is the configuration most modern production RAG systems land at.
The five-bullet version
- Hybrid retrieval combines BM25 (lexical) and dense vector search; agentic adds an LLM in a loop.
- Stacked together: hybrid retrieval gives a stronger first stage; the agent decides when to retrieve, with what query, against which index.
- Per-query, the agent can choose BM25-heavy (rare terms, IDs, code) or dense-heavy (paraphrases, semantic).
- Most production retrieval systems in 2026 look like this — hybrid first stage, optional rerank, agent loop on top.
- Cost grows: hybrid is ~2× the retrieval work; agentic adds multiple LLM calls per query.
§ 00 · TWO IDEAS, COMBINEDHybrid + agentic, why both
We’ve covered both pieces separately: hybrid retrievalhybrid retrieval. Combining lexical (BM25) and dense (vector) retrieval, scored together via rank fusion or weighted sum. Covers paraphrase and exact-match failure modes that pure dense or pure lexical miss. See Advanced RAG. (BM25 + dense, see Advanced RAG) and agentic RAGagentic RAG. A pattern where the LLM iteratively decides what to retrieve, observes results, and decides what to retrieve next. See the Agentic RAG lesson. (LLM in the retrieval loop, see Agentic RAG). They fix orthogonal problems:
- Hybrid fixes the first stage. Vector search misses exact terms; BM25 misses paraphrases. Combining them widens what a single retrieval can catch.
- Agentic fixes the loop.One retrieval often isn’t enough. Letting the LLM iterate handles multi-hop and refinement.
Stack them: hybrid retrieval as the building block, agentic loop as the controller. Each retrieval the agent issues is itself a hybrid search.
§ 01 · THE HYBRID LAYERQuick refresher
For each query, run two retrievals in parallel:
- Dense. Embed the query, find top-k chunks by cosine similarity. Good at paraphrase.
- Sparse / BM25. Tokenize the query, score chunks by lexical overlap weighted by term-frequency / inverse-document- frequency. Good at exact terms, IDs, code, rare words.
Combine via Reciprocal Rank FusionReciprocal Rank Fusion. A simple, parameter-free score-fusion method: for each doc, compute Σ 1/(k + rank_i) across the retrievers. Robust to retrievers having very different score scales. (RRF) or weighted sum. Optionally rerank with a cross-encoder.
§ 02 · THE AGENTIC LOOP ON TOPDriving multiple hybrid retrievals
With hybrid retrieval as a primitive, the agent operates above:
- Receive user question.
- Decide what to retrieve. Issue a
hybrid_search(query)call. - Observe the top-k results.
- Have enough? Answer. Need more? Issue another
hybrid_search. - Loop with step cap (typically 3–5).
Each retrieval is hybrid; the agent only sees the merged top-k. The agent doesn’t need to know whether dense or BM25 surfaced a given chunk — the fusion is invisible.
§ 03 · PER-QUERY ROUTING DECISIONSLetting the agent choose its weapon
A modest extension: expose multiple retrieval tools, so the agent can pick which strategy fits each sub-query.
search_semantic(q)— dense only. For paraphrases, conceptual queries.search_lexical(q)— BM25 only. For exact phrases, product names, error codes.search_hybrid(q)— both, fused. The default.
The agent learns to route: when the question says “what does error E-7321 mean,” lexical wins; when the question is “how do I roll back a release,” semantic wins; default otherwise to hybrid. A small system-prompt nudge teaches the routing.
§ 04 · PRODUCTION SHAPE AND TRADE-OFFSWhat this looks like at scale
Cost profile compared to vanilla RAG:
- Retrieval cost. Hybrid is roughly 2× a single retrieval. Cross-encoder rerank adds another ~10×, but only on the fused top-N.
- LLM cost.Vanilla RAG = 1 LLM call. Agentic hybrid = 1 + (loop iterations). For 80% of queries that finish in one iteration, the overhead is negligible. For the hard 20%, it’s 3–5×.
- Latency. Same shape. Most queries are fast; tail latency on multi-hop queries is the price you pay for handling them at all.
When this is the right choice:
- Heterogeneous corpus (docs, code, structured data).
- Varied query shapes (lookup, multi-hop, conceptual).
- Quality matters more than raw latency for the hardest queries.
When to skip:
- Tight latency budget, simple queries — naive or hybrid alone wins.
- Pure exact-phrase corpus (codebase search) — BM25 alone often suffices.
- Cost-constrained, high-volume — the loop multiplies LLM bills.
§ 05 · TAKING THIS FORWARDWhere this is going
The current frontier is making the loop cheaper (smaller, faster models for the routing/decision steps; cached intermediate retrievals) and the retrieval smarter (learned fusion weights, per-query embedding model selection). The architectural shape is stable; the engineering keeps improving.
§ · GOING DEEPERHybrid retrieval and when to add the agent on top
Hybrid retrieval = sparse + dense, fused. BM25 (Robertson & Zaragoza 2009) is a strong baseline for queries with exact terms — error codes, product names, acronyms. Dense vectors handle paraphrase. Fusing the two via Reciprocal Rank Fusion (RRF) or weighted score combination consistently beats either alone. This is the pattern almost every production search system uses in 2026.
Adding an agent loop on top buys you two things: multi-stepquestions and query reformulation. The model can run a cheap hybrid search, observe what came back, decide a sub-question wasn’t answered, and run a different search. Costs latency but unlocks question shapes the one-shot pipeline can’t handle. Worth the cost when the eval shows multi-hop or cross-document queries dominating the tail.
§ · FURTHER READINGReferences & deeper sources
- (2009). The Probabilistic Relevance Framework: BM25 and Beyond · Foundations and Trends in IR
- (2023). An Analysis of Fusion Functions for Hybrid Retrieval · ACM TOIS
- (2020). Dense Passage Retrieval · EMNLP
- (2022). ReAct · ICLR
- (2023). RAG Survey · arXiv
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.