Agentic RAG
Standard RAG retrieves once, then generates. Agentic RAG lets the model drive — decide what to look up, look at the result, decide what to look up next. Multi-hop questions, query refinement, and tool use beyond search all open up.
The five-bullet version
- Naive RAG retrieves once and answers. Multi-hop or refining queries break this pattern.
- Agentic RAG puts the LLM in a loop with a
search()tool — it decides what to retrieve, observes, decides what's next. - The agent can decompose questions, route to different indices, and combine results from multiple searches.
- Tool use generalizes beyond search — call APIs, run code, query databases as part of the same loop.
- Pay attention to latency and cost: every loop step adds an LLM call.
§ 00 · WHAT NAIVE RAG CAN’T DOSingle-shot retrieval’s limits
Naive RAG embeds the user query, retrieves the closest chunks, hands them to the LLM. One round trip, one answer. Many real questions don’t fit that shape.
Three failure shapes:
- Multi-hop.“Who is the CFO of the company that bought our biggest competitor last year?” — needs lookup A (find the acquisition), then lookup B (find the acquirer’s CFO).
- Decomposition.“Compare our cloud spend in AWS, GCP, and Azure for Q3” — three separate retrievals, one per cloud, then a synthesis.
- Refinement.First retrieval returns “see appendix B for details”; a follow-up retrieval is needed.
§ 01 · AGENTIC = LLM IN THE RETRIEVAL LOOPThe model drives
Agentic RAGagentic RAG. A RAG variant where the LLM iteratively decides what to retrieve. The model receives the user query, decides on a search, observes the results, and decides what to retrieve next — until it has enough to answer. treats retrieval as a tool the model calls multiple times in a loop:
- Receive user question.
- Decide what to search for first.
- Call
search(query). - Read the results.
- Decide: have enough? Answer. Need more? Pick the next search.
- Loop until done or step cap reached.
§ 02 · QUERY ROUTING & DECOMPOSITIONSmarter retrieval setup
Two specific patterns inside the loop:
- Routing. Different indices for different topics — one for product docs, one for sales, one for code. The model picks which index to query based on the question. A short classifier (LLM or fine-tuned BERT) routes each query to the right source.
- Decomposition.A complex question is broken into sub-questions, each retrieved separately, results combined. The model can do this explicitly (“I need to look up A and B and C”) or implicitly via successive loop steps.
§ 03 · TOOL USE BEYOND SEARCHRetrieval is one tool among several
Once the model is in a tool-calling loop, retrieval is just one tool. Add others:
sql(query)— for live data.python(code)— for calculations and chart generation.api.crm.lookup(id)— direct fetch from systems of record.web.search(query)— for current events.
The model decides which tool fits each step. Retrieval becomes one option in a richer action space. The pattern blends RAG with general agentic behavior — see the Agentic Patterns lesson.
§ 04 · WHEN AGENTIC EARNS ITS COSTLatency vs flexibility
Every loop iteration is at least one LLM call plus one tool call. Naive RAG is one round-trip; agentic might be 5–10. Latency multiplies; token spend multiplies. Use agentic RAG when:
- The query shapes are heterogeneous — some need one lookup, others need five. Naive RAG can’t adapt; agentic can.
- Multi-hop is common in your domain.
- You have multiple data sources that can’t reasonably be merged into one index (different shapes, different update frequencies, different access controls).
Skip agentic RAG when:
- The vast majority of queries are answerable from one retrieval.
- Latency budget is tight (< 1s end-to-end).
- Cost matters — adding loop iterations is the most expensive dimension to scale.
§ 05 · TAKING THIS FORWARDAdjacent variants to know
Next: Agentic Hybrid RAG (combines agentic looping with hybrid sparse-dense retrieval) and HyPA-RAG (hypothetical query / hierarchical variants). The space is moving fast — the patterns above are durable; the labels are not.
§ · GOING DEEPERWhen to upgrade from one-shot RAG to an agent
Agentic RAG places the LLM in a loop with retrieval as a tool the model can call multiple times. ReAct (Yao et al. 2022) defined the pattern: alternate Thought tokens with Action (tool call) tokens, observe results, decide what to do next. For single-hop factual queries, one-shot retrieval still wins on latency and cost; for multi-hop or comparative questions, the loop is the only thing that works.
Two follow-ups in the literature improve the basic loop. FLARE (Jiang et al. 2023) decides whento retrieve by watching the model’s token confidence — only triggers retrieval when the next token is uncertain. Self-RAG (Asai et al. 2023) trains the model to emit special tokens marking when to retrieve and when to critique its own output. Both cut unnecessary retrievals without sacrificing recall on the questions that need them.
§ · FURTHER READINGReferences & deeper sources
- (2022). ReAct: Synergizing Reasoning and Acting in Language Models · ICLR
- (2023). Toolformer: Language Models Can Teach Themselves to Use Tools · NeurIPS
- (2023). Self-RAG · ICLR
- (2023). Active Retrieval Augmented Generation (FLARE) · EMNLP
- (2023). Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions (IRCoT) · ACL
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.