Agents & RAG · Module 26·7 min read

HyPA-RAG

Two ideas that fix specific RAG failures: hypothetical document embeddings(HyDE) for queries that don’t look like answers, and parent-aware hierarchical retrieval for documents where chunks lose context.

Brain Drip EditorsUpdated May 2026·8 references

The five-bullet version

HyPA-RAG bundles two specific RAG patches: HyDE for query embedding, hierarchical / parent-aware chunking for retrieval.
HyDE: ask the LLM to write a fake answer to the query, embed that, use it to retrieve. The fake answer looks more like the real chunks than the question did.
Parent-aware retrieval: store small chunks for precision, but expand to the parent section when handing context to the LLM, so it has surrounding context.
Both fix concrete failure modes (short queries that embed badly; tiny chunks that lose context).
Worth adding when evals show those specific failures; otherwise extra complexity.

§ 00 · HYPOTHETICAL-DOCUMENT RETRIEVALWhy questions and answers don’t embed near each other

Dense retrieval works because semantically similar texts get nearby embeddings. The assumption: the query and the relevant chunk are semantically similar. This is often a stretch.

Compare:

Query.“How do I revert a release?” — short, imperative, second-person.
Doc chunk.“Release rollback procedure. Engineers can revert a deployment by running...” — long, declarative, third-person, full of nouns.

Both are aboutrolling back releases, but they don’t look alike. Their embeddings might be close but not as close as you want. A better-phrased question or a less-formally-written doc would produce closer embeddings.

§ 01 · HYDE — EMBED THE ANSWER, NOT THE QUESTIONBridge the shape mismatch

HyDEHyDE. Hypothetical Document Embeddings. Ask an LLM to generate a plausible answer to the query, then embed that answer (not the original query) and use it for retrieval. The hypothetical answer has the same 'shape' as real document chunks, so it embeds closer to them. flips the question into something more answer-shaped. The procedure:

Receive the user query.
Ask an LLM: “Write a plausible answer to this question.” The answer doesn’t have to be correct; it just has to look like an answer.
Embed that hypothetical answer.
Use the embedding for retrieval (instead of, or in addition to, the query embedding).

The hypothetical answer is in the same prose register as real chunks — declarative, full of nouns, third-person. Its embedding sits in the same neighborhood as real chunks about the same topic. Even though the hypothetical answer is invented, it acts as a bridge.

§ 02 · HIERARCHICAL RAG — CHUNKS WITHIN CHUNKSMultiple granularities

Chunk size is a Goldilocks problem. Small chunks (200 tokens) embed precisely — the embedding represents one focused idea. But when retrieved, they lack the surrounding context that makes the idea legible. Large chunks (2000 tokens) have context, but their embeddings are averaged across the whole chunk, so search precision suffers.

Hierarchical retrievalhierarchical retrieval. Index documents at multiple granularities — fine chunks for precision and coarse chunks (or whole sections) for context. At query time, retrieve the fine chunks, then expand to the parent section before handing to the LLM. gets both. Index the document at two granularities:

Small chunks for search precision. Each one is one focused passage.
Parent sections for context. Each small chunk knows which parent it belongs to.

At query time, search the small chunks. When handing results to the LLM, pass the parent section for each retrieved chunk instead of (or in addition to) the chunk itself. Precision from small chunks, context from parents.

Fig 1Multi-granularity indexing. Each granularity has a job; the structure makes them composable.

§ 03 · PARENT-DOCUMENT PATTERNSVariants on the theme

Three concrete patterns built from these ideas:

Small-to-big. Index small chunks; on retrieval, expand to the parent paragraph or section. The default.
Hierarchical summary. For each section, store a one-paragraph summary. Index summaries; on retrieval, fetch the full section. Useful when sections are very long.
Multi-vector per chunk. Store multiple embeddings per chunk — one for the chunk itself, one for an LLM-generated summary, one for a list of generated questions the chunk answers. Retrieval considers all three.

§ 04 · WHEN THE EXTRA PLUMBING PAYS OFFKnowing when to add the complexity

HyDE earns its cost when:

Queries are short, conversational, or in a very different register from the documents.
The corpus is well-written prose (which the LLM can plausibly imitate). HyDE is less useful for code or highly structured data.

Hierarchical retrieval earns its cost when:

Documents are long and structured (PDFs, manuals, contracts).
Fine details matter, but the answer needs surrounding context to interpret.
Pure chunk size — too small or too big — leaves obvious holes in eval scores.

CHECKA legal RAG system retrieves the right *section* (mostly accurate) but the LLM struggles to find the specific clause inside it. What's the best fix?

§ 05 · TAKING THIS FORWARDComposition over individual tricks

HyDE and hierarchical retrieval are two of many specific RAG patches. Reranking, hybrid retrieval, query expansion, multi-step retrieval — each fixes a particular failure mode. The right question for practitioners isn’t “which one is best?” but “which failure mode does my eval reveal?” The fix usually follows.

§ · GOING DEEPERHypothetical answers and hierarchical chunks

HyDE (Hypothetical Document Embeddings, Wang et al. 2023) is the trick. Use an LLM to generate a plausible answer to the user’s question, then embed that fake answer and use it to retrieve. The intuition: answers look like answers (declarative, dense with topical nouns), so the embedding of a synthetic answer is closer to real chunks than the embedding of the original question. Empirically, this lifts retrieval quality on hard queries with little engineering cost.

Hierarchical / parent-aware indexing is the second half. Index small chunks for retrieval precision; store their parents (sections, documents) for the LLM’s context. RAPTOR (Sarthi et al. 2024) takes this further: recursively summarize clusters of chunks into a tree, then retrieve from any level of the tree. The model gets fine-grained matches with the context needed to interpret them.

§ · FURTHER READINGReferences & deeper sources

Wang, Yang, Wei, Chang, Wei (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE) · ACL
Sarthi et al. (2024). RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval · ICLR
Edge et al. (2024). From Local to Global: A Graph RAG Approach · arXiv
Karpukhin et al. (2020). Dense Passage Retrieval · EMNLP
Gao et al. (2023). RAG Survey · arXiv

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.