One-Line Summary: Query decomposition breaks complex user queries into simpler sub-queries that can each be answered through targeted retrieval, while multi-step retrieval iteratively retrieves information where each step's findings inform the next -- together enabling RAG systems to answer complex, multi-faceted, and multi-hop questions that single-shot retrieval fundamentally cannot handle.

Prerequisites: Understanding of standard RAG retrieval (single query, single retrieval, single generation), the limitations of single-shot retrieval for complex questions, basic prompt engineering, the concept of chain-of-thought reasoning, and familiarity with agentic RAG patterns.

What Is Query Decomposition?

Most real-world questions are not simple. "Compare the environmental policies of the EU and China in terms of carbon pricing, renewable energy targets, and enforcement mechanisms" requires at least six separate pieces of information (EU carbon pricing, China carbon pricing, EU renewable targets, China renewable targets, EU enforcement, China enforcement). No single text chunk contains all of this. Standard RAG, which embeds the full query and retrieves the top-k most similar chunks, will likely return a mishmash of partially relevant documents that cover some aspects but not others.

flowchart LR
    S1["sub-queries"]
    S2["retrieving for each"]
    S3["synthesizing"]
    S1 --> S2
    S2 --> S3

Query decomposition explicitly breaks complex queries into atomic sub-queries, each simple enough to be answered by a focused retrieval step. The results are then synthesized into a comprehensive answer. This is the retrieval equivalent of chain-of-thought reasoning: instead of trying to solve a complex problem in one step, break it into manageable pieces.

Multi-step retrieval extends this further: the answer to one sub-query informs the formulation of the next sub-query. This is essential for multi-hop questions where you cannot know what to ask next until you have the answer to the current question. "Which company acquired the startup that developed the technology used in iPhone's Face ID?" requires first identifying the technology (structured light via TrueDepth camera), then the company that developed it (PrimeSense), then who acquired PrimeSense (Apple itself acquired it, but the question structure illustrates multi-hop reasoning).

How It Works

flowchart LR
    S1["IRCoT (Interleaving Retrieval"]
    S2["Chain of Thought)"]
    S3["alternating reasoning and retrieval steps"]
    S1 --> S2
    S2 --> S3

Technique 1: Query Decomposition (Parallel Sub-Queries)

The simplest form of decomposition breaks a complex query into independent sub-queries that can be executed in parallel.

Implementation:

Original query: "Compare the revenue models and market share of Uber and Lyft"
 
LLM Decomposition Step:
  Sub-query 1: "What is Uber's revenue model?"
  Sub-query 2: "What is Lyft's revenue model?"
  Sub-query 3: "What is Uber's current market share in ride-hailing?"
  Sub-query 4: "What is Lyft's current market share in ride-hailing?"
 
Parallel Retrieval:
  Retrieve top-k for each sub-query independently
 
Synthesis:
  Pass all retrieved chunks + original query to LLM for comparative answer

LlamaIndex implementation: The SubQuestionQueryEngine automates this pattern. Given a query and multiple data sources (indexes), it decomposes the query into sub-questions, routes each to the appropriate index, and synthesizes the results.

DSPy implementation: DSPy's MultiChainComparison module decomposes queries and compares results from multiple retrieval chains.

Key design decisions:

  • Who does the decomposition? Typically an LLM, prompted to break the query into atomic questions. The prompt can include few-shot examples of good decompositions.
  • How many sub-queries? Too few miss important aspects. Too many increase latency and cost. Typical: 2-6 sub-queries for most complex questions.
  • Independence assumption: Parallel decomposition assumes sub-queries are independent. If they are not (if the answer to one informs the next), multi-step retrieval is needed instead.

Technique 2: Step-Back Prompting

Introduced by Zheng et al. (2023, Google DeepMind), step-back prompting generates a broader, more abstract version of the query before retrieval, retrieving general background knowledge that helps answer the specific question.

How it works:

Original query: "What happens to the pressure of an ideal gas if the
temperature is increased by a factor of 2 and the volume is increased
by a factor of 8?"
 
Step-back question (generated by LLM): "What is the ideal gas law and
how do pressure, volume, and temperature relate?"
 
Retrieval: Retrieve for both the original query AND the step-back query.
 
Generation: LLM answers the specific question using both the background
context (ideal gas law: PV = nRT) and any specific retrieved passages.

Why it works: Many specific questions require background knowledge that the specific query does not retrieve well. The step-back question retrieves foundational knowledge, and the original query retrieves specific details. Together, they provide the LLM with both the principles and the specifics needed to reason through the answer.

Benchmark results from Zheng et al. (2023):

  • MMLU (physics, chemistry): Step-back prompting with PaLM-2L improved accuracy by 7-11% over standard prompting.
  • TimeQA (temporal reasoning): Step-back prompting improved accuracy from 56.4% to 68.3%.
  • MuSiQue (multi-hop QA): Improved accuracy by 5-7%.

The improvements are most pronounced on questions requiring reasoning from first principles or multi-step logic.

Technique 3: Multi-Hop Retrieval (Sequential/Iterative)

Multi-hop retrieval handles questions where each retrieval step depends on the results of the previous step. This is necessary for questions that require "following a chain" of information.

Example:

Query: "What university did the CEO of the company that acquired
Instagram attend?"
 
Step 1 - Retrieve: "Which company acquired Instagram?"
         Result: Facebook (Meta) acquired Instagram in 2012
 
Step 2 - Retrieve: "Who is the CEO of Facebook/Meta?"
         Result: Mark Zuckerberg is the CEO
 
Step 3 - Retrieve: "What university did Mark Zuckerberg attend?"
         Result: Harvard University
 
Final answer: Harvard University

Each step's query could not have been formulated without the previous step's answer. This is fundamentally different from parallel decomposition.

Key approaches to multi-hop retrieval:

IRCoT (Interleaving Retrieval with Chain-of-Thought): Trivedi et al. (2023) proposed interleaving retrieval steps with chain-of-thought reasoning. The LLM generates a reasoning step, retrieves evidence for it, generates the next reasoning step informed by the evidence, retrieves again, and so on. This tightly couples reasoning with retrieval.

ITER-RETGEN (Iterative Retrieval-Generation): Shao et al. (2023) proposed a model that alternates between retrieval and generation. Each generation step produces text that becomes the query for the next retrieval step. The process iterates until the generation converges on a complete answer.

ReAct-Style Retrieval: Using the ReAct framework (Yao et al., 2023), the LLM generates explicit reasoning traces and retrieval actions:

Thought: I need to find who acquired Instagram.
Action: search("Instagram acquisition")
Observation: Facebook acquired Instagram for $1B in 2012.
Thought: Now I need to find the CEO of Facebook.
Action: search("CEO of Facebook Meta")
Observation: Mark Zuckerberg has been CEO of Facebook/Meta since founding.
Thought: Now I need to find his university.
Action: search("Mark Zuckerberg university education")
Observation: Zuckerberg attended Harvard University.
Thought: I have all the information needed.
Answer: Mark Zuckerberg attended Harvard University.

Technique 4: Query Expansion and Reformulation

Rather than decomposing into sub-queries, query expansion generates multiple reformulations of the same query to improve retrieval recall.

Multi-query generation: An LLM generates 3-5 alternative phrasings of the original query:

Original: "How do neural networks learn?"
Variant 1: "What is the training process for neural networks?"
Variant 2: "How does backpropagation enable learning in neural networks?"
Variant 3: "What mechanisms allow neural networks to improve from data?"

Each variant retrieves documents independently, and the results are merged (via reciprocal rank fusion or deduplication). Different phrasings activate different regions of the embedding space, improving recall.

RAG-Fusion: Raudaschl (2023) formalized this as RAG-Fusion: generate multiple query variants, retrieve for each, merge results with reciprocal rank fusion, then generate the final answer from the merged results. Simple but surprisingly effective -- typically improves recall by 10-20%.

Technique 5: Least-to-Most Prompting for Decomposition

Zhou et al. (2023, Google Research) introduced least-to-most prompting, where complex problems are decomposed from easiest to hardest, and each sub-problem is solved in order with the solutions to easier problems available as context.

Applied to retrieval, this means:

  1. Decompose the query into sub-questions, ordered from simplest to most complex.
  2. Retrieve and answer the simplest sub-question first.
  3. Include that answer in the context when retrieving for the next sub-question.
  4. Continue until all sub-questions are answered.
  5. Synthesize the final answer.

This is particularly effective for questions requiring cumulative reasoning.

The Connection to Agentic RAG

Query decomposition and multi-step retrieval are the core retrieval patterns that define agentic RAG. The progression is:

  1. Naive RAG: Single query, single retrieval, single generation.
  2. Advanced RAG with decomposition: LLM decomposes query before retrieval, but the pipeline is still deterministic.
  3. Agentic RAG: The LLM dynamically decides whether to decompose, how many steps to take, when to retrieve, and when to stop -- making all decisions adaptively based on intermediate results.

In practice, query decomposition is often the first step in building an agentic RAG system. You start with a deterministic decomposition pipeline, then add adaptive decision-making as you discover which queries benefit from it.

Why It Matters

Complex queries are the norm, not the exception: In production RAG systems, 40-60% of user queries are multi-faceted, requiring information from multiple documents or topics. Single-shot retrieval handles only the simplest 40-60%. Query decomposition extends RAG to the full distribution of real queries.

Multi-hop reasoning is required for expert-level QA: Domain-expert questions (legal research, medical diagnosis, financial analysis) almost always require following chains of information. "What are the tax implications for a Delaware LLC with partners in California and New York?" requires understanding Delaware LLC law, California income tax rules, New York income tax rules, and how they interact. No single retrieval step will find all of this.

Retrieval recall is the bottleneck: The most common RAG failure mode is that the relevant information is not in the retrieved set. Decomposing queries into focused sub-queries, each targeting a specific piece of information, dramatically improves recall for complex questions.

Composability with other techniques: Query decomposition can be combined with virtually every other RAG enhancement:

  • HyDE can be applied to each sub-query
  • Cross-encoder reranking can refine results for each sub-query
  • RAPTOR's hierarchical index can provide different levels of detail for different sub-queries
  • Self-RAG's reflection tokens can evaluate whether each sub-query was answered satisfactorily

Key Technical Details

  • Decomposition quality: The quality of the decomposition directly determines retrieval effectiveness. Poor decomposition (too vague, too many sub-queries, missing important aspects) leads to poor retrieval. Few-shot examples in the decomposition prompt are essential for consistent quality.

  • Latency: Each retrieval step adds 200-500ms (embedding + vector search + optional reranking). A 4-step multi-hop retrieval adds 1-2 seconds on top of the generation time. Parallel sub-queries can be executed concurrently, reducing total latency to that of the slowest sub-query. Sequential (multi-hop) queries cannot be parallelized.

  • Context window management: With multiple sub-queries, the total retrieved context can easily exceed the LLM's context window or the optimal context length. Strategies include: summarizing intermediate results, selecting only the most relevant chunks per sub-query, and using map-reduce synthesis (answer each sub-query independently, then synthesize).

  • When NOT to decompose: Simple factual queries ("What is the capital of France?") should not be decomposed. Unnecessary decomposition adds latency and can actually hurt performance by fragmenting a naturally coherent query. A classifier or the LLM itself should decide whether decomposition is needed.

  • Decomposition + routing: In multi-index systems, decomposition naturally enables routing. Each sub-query can be directed to the most appropriate index or data source. A financial question might route sub-queries about revenue to the financial database and sub-queries about market trends to a news index.

  • Evaluation: Evaluating decomposition quality is challenging. Metrics include: sub-query coverage (do the sub-queries cover all aspects of the original query?), sub-query independence (are the sub-queries answerable independently?), and final answer quality (does the synthesized answer fully address the original query?).

  • Cost: Query decomposition multiplies the number of retrieval and LLM calls. A query decomposed into 4 sub-queries requires 4 retrieval operations, potentially 4 reranking steps, and at minimum 2 LLM calls (one for decomposition, one for synthesis -- often more). For cost-sensitive applications, decomposition should be applied selectively.

Common Misconceptions

"Query decomposition always improves results." For simple, focused queries, decomposition adds unnecessary complexity and can actually hurt by fragmenting a query that was best handled as a whole. The key is adaptive decomposition -- only decomposing when the query is genuinely multi-faceted.

"More sub-queries are always better." Diminishing returns set in quickly. Beyond 4-6 sub-queries, the additional sub-queries typically overlap with existing ones or are too granular to retrieve meaningful results. The synthesis step also becomes harder with more sub-query results to integrate.

"Multi-hop retrieval is the same as multi-turn conversation." Multi-hop retrieval is within a single user interaction -- the system internally performs multiple retrieval steps to answer one question. Multi-turn conversation involves the user providing additional information or follow-up questions across turns. They are different mechanisms, though multi-turn context can inform decomposition.

"Step-back prompting is just asking a simpler question." Step-back prompting specifically asks for the underlying principle or concept behind the specific question. It is not simplification -- it is abstraction. The step-back question is often harder to answer than the original but provides the foundational knowledge needed.

"Query decomposition replaces good chunking." Decomposition helps retrieve the right chunks, but if the chunks themselves are poorly constructed (splitting sentences, missing context), even perfect sub-queries will retrieve poor content. Decomposition and chunking quality are orthogonal concerns.

Connections to Other Concepts

  • rag.md: Query decomposition and multi-step retrieval are advanced RAG patterns that extend the basic retrieve-and-generate pipeline to handle complex queries.
  • agentic-rag.md: Query decomposition and multi-step retrieval are the core retrieval patterns within agentic RAG. Agentic RAG adds adaptive decision-making on top of these patterns.
  • hyde-hypothetical-document-embeddings.md: HyDE can be applied to individual sub-queries for improved retrieval within a decomposition pipeline.
  • reranking-and-cross-encoders.md: Reranking can be applied after retrieval for each sub-query, improving precision at each step of the multi-step process.
  • self-rag.md: Self-RAG's [Retrieve] token implements a form of adaptive retrieval decision-making, deciding at each generation step whether additional retrieval (a new step) is needed.
  • corrective-rag.md: CRAG's evaluation step can be applied after each sub-query retrieval, triggering fallback retrieval for sub-queries where initial results are irrelevant.
  • raptor.md: RAPTOR's hierarchical index is particularly well-suited for decomposed queries. High-level sub-queries ("What is the overall theme?") retrieve summary nodes; specific sub-queries retrieve leaf nodes.
  • chain-of-thought-in-agents.md: Query decomposition is the retrieval analog of chain-of-thought reasoning. Both break complex problems into sequential steps. IRCoT explicitly interleaves them.
  • prompt-engineering.md: The decomposition prompt is a critical piece of prompt engineering. Few-shot examples, output format instructions, and domain-specific guidance all affect decomposition quality.
  • compound-ai-systems.md: Multi-step retrieval systems with decomposition, routing, reranking, and synthesis are compound AI systems with multiple interacting components.

Further Reading

  • Zheng, H. et al. (2023). "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." (arXiv: 2310.06117) The step-back prompting paper from Google DeepMind.
  • Trivedi, H. et al. (2023). "Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions." ACL 2023. (arXiv: 2212.10509) The IRCoT paper interleaving retrieval with chain-of-thought reasoning.
  • Zhou, D. et al. (2023). "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models." ICLR 2023. The least-to-most decomposition framework from Google Research.
  • Yao, S. et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. The foundational framework for interleaving reasoning with tool use (including retrieval).
  • Shao, Z. et al. (2023). "Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy." (arXiv: 2305.15294) The ITER-RETGEN paper on iterative retrieval-generation.
  • Press, O. et al. (2023). "Measuring and Narrowing the Compositionality Gap in Language Models." EMNLP 2023. Demonstrates that multi-hop questions expose a "compositionality gap" in LLMs that multi-step retrieval can help close.
  • Raudaschl, A. (2023). "RAG-Fusion: a New Take on Retrieval-Augmented Generation." Blog post and implementation introducing multi-query generation with reciprocal rank fusion.