Courses·Alignment And Post Training·7 min read

Constitutional AI (CAI)

Constitutional AI aligns language models by replacing human preference labels with AI-generated feedback guided by an explicit set of principles (a "constitution"), making the alignment process more scalable, transparent, and auditable.

One-Line Summary: Constitutional AI aligns language models by replacing human preference labels with AI-generated feedback guided by an explicit set of principles (a "constitution"), making the alignment process more scalable, transparent, and auditable.

Prerequisites: Understanding of RLHF (the full pipeline including reward models and policy optimization), supervised fine-tuning, and the basic challenges of AI alignment including safety and helpfulness trade-offs.

What Is Constitutional AI?

Traditional RLHF relies on thousands of human annotators to compare model outputs and express preferences. This is expensive, slow, hard to scale, and -- critically -- opaque. When a human rater says "Response A is better than Response B," you don't always know why. Different raters may have different values, creating inconsistent training signals.

flowchart LR
    S1["Constitutional AI pipeline"]
    S2["the critique-revision loop guided by expli"]
    S1 --> S2

Constitutional AI, developed by Anthropic, takes a radically different approach. Instead of relying on human raters for each individual comparison, you write down your values explicitly as a set of principles -- a "constitution" -- and then use AI to apply those principles consistently across all training data.

Think of the analogy of a legal system. RLHF is like having a jury decide every individual case without written law -- you get human judgment, but it's inconsistent and unscalable. Constitutional AI is like creating a written legal code (the constitution) and having a trained judge (the AI) apply it consistently to every case. The principles are explicit, auditable, and can be debated and revised.

How It Works

Constitutional AI operates in two main phases:

flowchart LR
    subgraph L1["Comparison of RLHF with human feedback"]
        LI3["Comparison of RLHF"]
        LI4["human feedback"]
    end
    subgraph R2["CAI with AI feedback"]
        RI5["AI feedback"]
    end

Phase 1: Supervised Learning from AI Feedback (SL-CAI)

Generate initial responses: Prompt the model (typically a helpful-only model that hasn't been safety-trained) with red-team prompts designed to elicit harmful or problematic outputs. The model generates responses that may be harmful.
Self-critique: Ask the same model to critique its own response according to a specific constitutional principle. For example:

Principle: "Choose the response that is least likely to be used for harmful purposes." Critique prompt: "Identify specific ways in which the assistant's last response is harmful, unethical, or problematic. Please be as specific as possible." The model generates a critique explaining what's wrong with its response.
Revision: Ask the model to revise its response in light of the critique:

"Please rewrite the assistant's response to remove any harmful content, while remaining helpful." The model produces an improved response.
Repeat: Multiple rounds of critique and revision can be applied, each using different constitutional principles. This progressively improves the response.
Fine-tune: Use the (prompt, revised response) pairs as supervised fine-tuning data. This creates a model that has internalized the constitutional principles through the revised examples.

Phase 2: RL from AI Feedback (RLAIF)

Generate comparison pairs: For a set of prompts, generate multiple responses from the SL-CAI model.
AI preference labeling: Instead of having humans compare responses, ask a language model to choose which response better adheres to the constitutional principles. The model is given the relevant principles and asked to make a judgment:

"Consider the following principles: [constitution excerpt]. Which response better follows these principles?"
Train a reward model: Use the AI-generated preference labels to train a reward model, exactly as in standard RLHF.
Optimize with RL: Use the reward model to optimize the policy via PPO or another RL algorithm, again following the standard RLHF approach but with AI-generated rather than human-generated preference data.

The Constitution

The constitution itself is a set of natural-language principles. Examples from Anthropic's published constitution include:

"Choose the response that is most supportive and encouraging of life, liberty, and personal security."
"Choose the response that is least likely to be perceived as harmful or offensive by a non-expert."
"Choose the response that most accurately and honestly answers the question, given the context."
Principles drawn from the UN Universal Declaration of Human Rights, Apple's terms of service, and other sources.

The beauty of this approach is that the principles can be inspected, debated, and refined. Different organizations can create different constitutions reflecting different values.

Constitutional Classifiers

A more recent extension of the CAI framework is Constitutional Classifiers -- specialized models trained to detect whether inputs or outputs violate constitutional principles. Rather than baking all safety behavior into the generative model, these classifiers act as external guardrails:

An input classifier screens user prompts for potential policy violations before the model generates a response.
An output classifier screens the model's responses before they are delivered to the user.

These classifiers can be updated independently of the main model, allowing for rapid iteration on safety boundaries. They are trained using synthetically generated data -- the model generates examples of content that would violate each principle, and these examples (along with benign examples) become the training data for the classifier.

Why It Matters

Constitutional AI addresses several fundamental limitations of standard RLHF:

Scalability: Human annotation is expensive and slow. AI feedback can be generated at massive scale for a fraction of the cost, enabling preference training on far more data.

Consistency: Human raters disagree with each other frequently (inter-annotator agreement on preference tasks is often only 60-75%). A constitution applied by a model produces more consistent judgments.

Transparency: The constitution makes the alignment criteria explicit. Stakeholders can inspect, critique, and debate the principles rather than trying to reverse-engineer them from implicit human preferences.

Reduced harm to annotators: RLHF for safety requires human annotators to read and evaluate harmful content, which can be psychologically damaging. CAI reduces (though doesn't eliminate) this exposure.

Auditability: When a model behaves in a certain way, you can trace it back to specific constitutional principles, making the alignment process more interpretable.

Key Technical Details

Constitution design is a new and important skill. Too few principles lead to gaps; too many can create conflicts. The principles must be specific enough to guide behavior but general enough to cover novel situations.
The quality of AI feedback improves with model capability. More capable models apply constitutional principles more accurately, creating a virtuous cycle where better models enable better alignment.
Multiple critique-revision rounds yield diminishing returns. Typically 2-4 rounds are used; beyond that, improvements are marginal.
RLAIF does not fully replace human oversight. Anthropic and others still use human evaluation to validate that AI feedback is well-calibrated. The constitution itself is written by humans.
Constitutional Classifiers can be jailbreak-resistant. By training on a large corpus of synthetically generated adversarial attacks, these classifiers can generalize to novel jailbreak attempts better than rule-based filters.
The constitution can be customized per deployment context. An enterprise deployment might add industry-specific principles; a children's education application might emphasize age-appropriateness.

Common Misconceptions

"Constitutional AI means no humans are involved." Humans write the constitution, humans evaluate the system's overall behavior, and humans make decisions about what principles to include. CAI shifts human effort from individual labeling to system-level design.
"The constitution is fixed and universal." Different applications may warrant different constitutions. The principles are meant to be iterated upon and adapted.
"CAI only addresses safety." While safety is a major focus, constitutional principles can also address helpfulness, accuracy, tone, and other quality dimensions.
"AI feedback is always worse than human feedback." Studies have shown that RLAIF can match or even exceed RLHF performance in some settings, particularly when the AI feedback provider is significantly more capable than typical human annotators.
"Constitutional AI eliminates alignment failures." CAI reduces certain failure modes but introduces others, such as the AI misinterpreting a principle or the constitution having gaps. No alignment technique is complete.

Connections to Other Concepts

rlhf.md: is the foundation that CAI modifies. CAI uses the same reward model and RL optimization steps but changes the source of preference labels from humans to AI.
dpo.md: can be used in place of PPO in the RL phase of CAI, combining the benefits of both approaches -- AI-generated preferences with simplified optimization.
reward-modeling.md: still plays a role in Phase 2 of CAI, with AI preferences used to train the reward model.
synthetic-data.md: generation is at the heart of CAI -- the critiques, revisions, and preference labels are all forms of synthetic training data.
red-teaming.md: often provides the adversarial prompts used in Phase 1 to elicit responses that need constitutional revision.