Courses·Advanced And Emerging·10 min read

Self-RAG (Self-Reflective Retrieval-Augmented Generation)

Self-RAG trains a single language model to adaptively decide when to retrieve external knowledge, evaluate whether retrieved passages are relevant, assess whether its own generation is supported by the evidence, and judge the overall utility of its response -- all through special reflection tokens learned during training, eliminating the need for separate retriever and critic components.

One-Line Summary: Self-RAG trains a single language model to adaptively decide when to retrieve external knowledge, evaluate whether retrieved passages are relevant, assess whether its own generation is supported by the evidence, and judge the overall utility of its response -- all through special reflection tokens learned during training, eliminating the need for separate retriever and critic components.

Prerequisites: Understanding of standard RAG pipelines and their fixed retrieve-then-generate structure, the concept of special tokens in language models, instruction tuning and supervised fine-tuning, the problem of hallucination in LLM generation, and awareness that standard RAG always retrieves regardless of whether retrieval is needed.

What Is Self-RAG?

Standard RAG has two fundamental rigidities. First, it always retrieves -- even for questions the model can answer from parametric knowledge ("What is the capital of France?"), retrieval adds latency and may introduce noise. Second, it never evaluates -- the system has no mechanism to assess whether the retrieved passages are relevant to the query or whether the generated answer is actually supported by the retrieved evidence.

flowchart TD
    C1["Retrieve"]
    C2["IsRel"]
    C3["IsSup"]
    C4["IsUse"]
    C1 --> C2
    C2 --> C3
    C3 --> C4

Self-RAG, introduced by Asai, Wu, Wang, Sil, and Hajishirzi (2024, ICLR 2024), addresses both problems by training a single LLM to generate special reflection tokens that control and evaluate the retrieval-generation process. The model learns to:

Decide whether retrieval is needed for a given query (or at any point during generation).
Evaluate whether each retrieved passage is relevant to the query.
Assess whether each generated segment is supported by the retrieved evidence.
Judge the overall utility of the complete response.

These capabilities are encoded as four types of special tokens that the model generates inline with its regular text output. The model is trained end-to-end to produce these tokens through supervised fine-tuning on data annotated by a critic model (GPT-4 in the original paper).

How It Works

The Four Reflection Token Types

Self-RAG introduces four special token types, each generated at specific points during the generation process:

1. [Retrieve] -- Retrieval Decision Token

Generated at the beginning of a response or at segment boundaries during generation. Values:

[Retrieve=Yes]: The model determines it needs external information to answer accurately. Triggers a retrieval step.
[Retrieve=No]: The model is confident it can answer from parametric knowledge. Skips retrieval.
[Retrieve=Continue]: The model is mid-generation and determines it needs additional information to continue.

This is the gate that makes retrieval adaptive rather than mandatory. For factual questions about recent events, the model retrieves. For general knowledge questions or creative tasks, it skips retrieval entirely.

2. [IsRel] -- Relevance Token

Generated after retrieving passages, one per retrieved passage. Values:

[IsRel=Relevant]: The passage contains information relevant to the query.
[IsRel=Irrelevant]: The passage is not relevant.

This allows the model to filter out irrelevant passages before generation, preventing the common RAG failure mode where the model generates answers from irrelevant context.

3. [IsSup] -- Support Token

Generated after each segment of the response, evaluating whether that segment is supported by the retrieved evidence. Values:

[IsSup=Fully Supported]: The generated claim is directly supported by the retrieved passages.
[IsSup=Partially Supported]: Some aspects are supported, others are not.
[IsSup=No Support]: The generated claim is not supported by the retrieved evidence (potential hallucination).

This is Self-RAG's hallucination detection mechanism. By evaluating support at the segment level rather than the response level, it can identify specifically which parts of a response are grounded vs. hallucinated.

4. [IsUse] -- Utility Token

Generated at the end of the complete response. Values on a scale of 1-5:

[IsUse=5]: The response is highly useful, comprehensive, and directly addresses the query.
[IsUse=1]: The response is not useful or does not address the query.

This enables the model to self-evaluate overall response quality, which can be used for response selection when generating multiple candidate responses.

The Self-RAG Generation Process

Input query -> Model generates [Retrieve] token
  |
  |-- [Retrieve=No] -> Generate response directly from parametric knowledge
  |                     -> Generate [IsSup] for each segment
  |                     -> Generate [IsUse] for overall quality
  |
  |-- [Retrieve=Yes] -> Retrieve top-k passages from knowledge base
                        -> For each passage, generate [IsRel]
                        -> Filter to relevant passages
                        -> Generate response segment using relevant passages
                        -> Generate [IsSup] to assess evidence support
                        -> Optionally generate [Retrieve=Continue] for more info
                        -> Generate [IsUse] for overall quality

Inference-Time Flexibility with Critique Tokens

A powerful feature of Self-RAG is that the reflection tokens can be used to control generation behavior at inference time without retraining:

Beam search with reflection scores: Generate multiple candidate responses and use the reflection token scores as beam search criteria. Select the response with the highest combination of relevance, support, and utility scores.

Adjustable retrieval frequency: By modifying the threshold for [Retrieve=Yes], you can make the system retrieve more or less frequently depending on the use case. For high-stakes factual domains, lower the threshold to retrieve more aggressively. For creative tasks, raise it.

Adjustable faithfulness vs. creativity: By weighting [IsSup] scores higher or lower during candidate selection, you can trade off between strict factual grounding and more creative/comprehensive responses.

This inference-time controllability is a major advantage over standard RAG, which has no such knobs.

Training Process

Self-RAG is trained in three stages:

Stage 1: Critic Model Training

A critic model (the paper uses GPT-4) is used to annotate a large corpus of (query, response) pairs with the four reflection token types. For each example, the critic:

Decides whether retrieval would be needed
Evaluates the relevance of retrieved passages
Assesses whether response segments are supported by evidence
Rates overall utility

This produces a dataset of text interleaved with reflection token annotations.

Stage 2: Generator Model Training

The target LLM (the paper uses Llama 2 7B and 13B as base models) is fine-tuned on this annotated dataset using standard next-token prediction. The reflection tokens are added to the model's vocabulary, and the model learns to predict them as part of the regular token sequence. This means the model learns not just to generate text, but to generate reflection tokens at the appropriate positions.

Stage 3: Retriever Integration

A standard dense retriever (the paper uses Contriever-MS MARCO) is used to retrieve passages when the model generates [Retrieve=Yes]. The retriever is not trained jointly with the generator -- it is an off-the-shelf component.

Training Data Scale

The original Self-RAG paper:

Used GPT-4 to annotate approximately 150K instruction-following examples from diverse datasets
Trained the critic model on this data
Fine-tuned Llama 2 (7B and 13B) as the generator
Used Contriever-MS MARCO as the retriever with a Wikipedia-based knowledge base

Benchmark Results

Self-RAG demonstrated strong results across multiple benchmarks, outperforming both standard RAG and retrieval-free baselines:

Open-domain QA (PopQA, TriviaQA-unfiltered):

Self-RAG (13B) achieved 54.9% accuracy on PopQA, compared to 44.0% for Llama2-13B with standard RAG and 29.3% for Llama2-13B without retrieval.
On TriviaQA, Self-RAG (13B) achieved 69.3%, compared to 60.4% for standard RAG.

Fact Verification (PubHealth):

Self-RAG (7B) achieved 72.4% accuracy, compared to 67.5% for Llama2-7B + RAG.

Long-form Generation (ASQA -- Ambiguous Short Question Answering):

Self-RAG significantly improved both factual accuracy (measured by correctness) and citation quality (measured by citation precision and recall).

Reasoning (ARC-Challenge):

Self-RAG (13B) achieved 67.3%, compared to 63.2% for ChatGPT.

Key finding: Self-RAG's adaptive retrieval means it retrieves only when beneficial. On questions where the model has sufficient parametric knowledge, skipping retrieval actually improves performance by avoiding noise from irrelevant passages. The paper showed that Self-RAG retrieved for approximately 50-80% of queries depending on the task, correctly skipping retrieval for straightforward questions.

Why It Matters

Self-RAG represents a paradigm shift in how we think about RAG systems:

From pipeline to unified model: Standard RAG is a pipeline of separate components (retriever, generator, optional reranker) with no feedback between them. Self-RAG collapses the retrieval decision, relevance evaluation, and faithfulness assessment into the generator itself. This eliminates the "impedance mismatch" between components.

Adaptive retrieval: Real-world queries vary enormously in their information needs. Some require extensive retrieval; others are trivially answerable from parametric knowledge. Self-RAG adapts on a per-query basis, avoiding unnecessary retrieval for simple questions and retrieving aggressively for knowledge-intensive ones.

Built-in hallucination detection: The [IsSup] token provides a native mechanism for detecting unsupported claims. This is more principled than post-hoc hallucination detection because it is learned jointly with the generation process.

Inference-time control: The reflection tokens provide tunable knobs for controlling the retrieval-generation trade-off at inference time, without retraining. This is valuable for deploying the same model across different use cases with different faithfulness requirements.

Efficiency: By skipping retrieval for queries that do not need it, Self-RAG reduces both latency and cost compared to standard RAG, which always incurs retrieval overhead.

Key Technical Details

Reflection token vocabulary: The four reflection token types and their values add approximately 10-15 new tokens to the model's vocabulary. These are learned embeddings, just like regular token embeddings.
Segment-level evaluation: Self-RAG generates [IsSup] tokens at segment boundaries (typically every 1-3 sentences), not at the end of the full response. This provides fine-grained faithfulness evaluation.
Parallel candidate generation: At inference time, Self-RAG can generate multiple candidates (one for each retrieved passage, plus one without retrieval) and use reflection scores to select the best one. This is similar to best-of-N sampling but guided by learned quality signals.
Retriever agnostic: While the paper uses Contriever, Self-RAG works with any retriever. The reflection tokens evaluate the retrieved content regardless of how it was obtained.
Computational cost: Training Self-RAG requires generating critic annotations (expensive, one-time GPT-4 cost) and fine-tuning the base model (standard SFT cost). At inference time, the reflection tokens add minimal overhead (~5-10% more tokens generated).
Model sizes tested: The paper evaluated 7B and 13B parameter models. The 13B model consistently outperformed the 7B model, suggesting that the reflection capability benefits from larger model capacity.

Common Misconceptions

"Self-RAG requires GPT-4 at inference time." GPT-4 is only used during training to generate critic annotations. At inference time, the fine-tuned model (Llama 2 7B/13B in the paper) generates reflection tokens itself. The entire system runs on a single, relatively small model.

"Self-RAG is the same as Corrective RAG (CRAG)." CRAG adds a separate evaluation step after retrieval using an external evaluator (a classifier or LLM-as-judge). Self-RAG trains the generator itself to perform this evaluation through learned tokens. Self-RAG is a single-model approach; CRAG is a multi-component pipeline approach. They address similar problems (evaluating retrieval quality) but with fundamentally different architectures.

"The reflection tokens are just prompting tricks." The reflection tokens are learned parameters in the model's vocabulary. The model is trained to predict them through supervised fine-tuning on annotated data. They are not prompt-injected instructions -- they are genuine model outputs that reflect learned self-evaluation capabilities.

"Self-RAG eliminates the need for a retriever." Self-RAG still uses an external retriever (Contriever, or any dense retriever) to fetch passages when the model decides retrieval is needed. The innovation is in making retrieval adaptive and adding self-evaluation, not in replacing the retriever.

"Self-RAG works only with Llama models." The Self-RAG training process can be applied to any instruction-tunable LLM. The paper used Llama 2 as the base, but the technique is architecture-agnostic.

Connections to Other Concepts

rag.md: Self-RAG is an advanced RAG paradigm that makes retrieval adaptive and self-evaluating, addressing the fixed-pipeline limitations of standard RAG.
corrective-rag.md: Both address retrieval evaluation, but CRAG uses external evaluators while Self-RAG uses internal reflection tokens. They share the insight that retrieval quality must be assessed before generation.
agentic-rag.md: Self-RAG can be seen as a lightweight form of agentic RAG where the agent's decision-making (when to retrieve, what is relevant) is internalized in the model rather than implemented as explicit agent logic.
hallucination.md: Self-RAG directly addresses hallucination through the [IsSup] token, providing a native mechanism for detecting unsupported claims during generation.
constitutional-ai.md: Both Self-RAG and Constitutional AI train models to self-evaluate and self-correct. Self-RAG applies this principle to retrieval and factual grounding; Constitutional AI applies it to safety and harmlessness.
reward-modeling.md: The reflection tokens can be viewed as a form of process reward -- evaluating intermediate steps (retrieval decisions, relevance, support) rather than just the final output.