Courses·Alignment And Post Training·7 min read

Preference Learning Variants

Alternatives to DPO that reduce data requirements, simplify training pipelines, or improve robustness -- each trading off different aspects of preference optimization.

One-Line Summary: Alternatives to DPO that reduce data requirements, simplify training pipelines, or improve robustness -- each trading off different aspects of preference optimization.

Prerequisites: RLHF, Direct Preference Optimization (DPO), reward modeling, supervised fine-tuning (SFT), KL divergence, cross-entropy loss

What Is Preference Learning Variants?

Imagine you are training a new restaurant chef. DPO is like showing the chef two dishes side by side for every meal -- "this steak is better than that steak" -- and requiring a perfectly matched pair every time. This works, but sourcing those exact pairs is expensive and labor-intensive. Now imagine alternatives: one approach (KTO) just uses individual diner reviews ("thumbs up" or "thumbs down" on single dishes), another (SimPO) lets the chef self-evaluate without constantly referencing a baseline recipe, another (ORPO) combines cooking school and taste refinement into a single curriculum, and yet another (IPO) adds guardrails so the chef does not over-index on a few strongly-opinionated reviews.

flowchart LR
    subgraph L1["Comparison of DPO, IPO, KTO,"]
        LI3["their data requirements"]
    end
    subgraph R2["ORPO loss functions"]
        RI4["Feature 1"]
    end

These are the preference learning variants -- a family of algorithms that all aim to align language models with human preferences, but each relaxes or restructures a different assumption that DPO makes. DPO itself was a breakthrough because it eliminated the need for a separate reward model, distilling the RLHF objective into a simple classification loss over preference pairs. But DPO still has friction points: it requires carefully curated paired data, needs a frozen reference model in memory, and can overfit to preference margins. The variants described here each tackle one or more of those pain points.

The rapid proliferation of these methods -- ORPO, SimPO, KTO, IPO, and others like SPPO and RPO -- reflects the field's recognition that preference alignment is not one-size-fits-all. Different deployment scenarios have different data availability, compute budgets, and robustness requirements.

How It Works

flowchart LR
    S1["KTO architecture"]
    S2["optimization from binary feedback (thumbs "]
    S1 --> S2

KTO: Kahneman-Tversky Optimization

KTO (Ethayarajh et al., 2024) is perhaps the most radical departure from DPO. Instead of requiring paired comparisons (chosen and rejected responses for the same prompt), KTO works with unpaired binary feedback -- simple thumbs-up or thumbs-down on individual outputs. The key insight draws from prospect theory in behavioral economics: humans weigh losses more heavily than equivalent gains.

The KTO loss function treats desirable and undesirable outputs asymmetrically:

# Simplified KTO loss structure
def kto_loss(policy_logps, ref_logps, is_desirable, beta=0.1):
    # KL divergence term for baseline
    kl = (policy_logps - ref_logps).mean().clamp(min=0)
 
    # Log-ratio between policy and reference
    log_ratio = policy_logps - ref_logps
 
    # Asymmetric weighting inspired by prospect theory
    desirable_loss = -F.logsigmoid(beta * (log_ratio - kl))
    undesirable_loss = -F.logsigmoid(beta * (kl - log_ratio))
 
    loss = torch.where(is_desirable, desirable_loss, undesirable_loss)
    return loss.mean()

This is transformative for data collection. Most real-world feedback systems (like/dislike buttons, flagging mechanisms) produce unpaired binary signals, not carefully matched preference pairs. KTO can consume this data directly.

SimPO: Simple Preference Optimization

SimPO (Meng et al., 2024) eliminates the reference model entirely. DPO requires keeping a frozen copy of the original model in GPU memory during training to compute log-probability ratios. SimPO replaces this with the average log probability of the sequence as an implicit reward signal, plus a target margin:

# SimPO reward: length-normalized log probability
def simpo_reward(logprobs, sequence_length):
    return logprobs.sum() / sequence_length  # average log-prob
 
# SimPO loss with target margin gamma
def simpo_loss(chosen_logprobs, rejected_logprobs, chosen_len, rejected_len, beta=2.0, gamma=1.4):
    chosen_reward = simpo_reward(chosen_logprobs, chosen_len)
    rejected_reward = simpo_reward(rejected_logprobs, rejected_len)
    loss = -F.logsigmoid(beta * (chosen_reward - rejected_reward) - gamma)
    return loss.mean()

The length normalization prevents the model from gaming the reward by simply producing longer sequences. The target margin gamma ensures the model learns a meaningful separation between chosen and rejected responses rather than barely distinguishing them.

ORPO: Odds Ratio Preference Optimization

ORPO (Hong et al., 2024) merges SFT and preference alignment into a single training stage. Standard pipelines run SFT first, then apply DPO or RLHF. ORPO combines both objectives: a standard negative log-likelihood loss for the chosen response plus an odds-ratio-based penalty for the rejected response:

# ORPO: combined SFT + alignment in one loss
def orpo_loss(chosen_logprobs, rejected_logprobs, lambda_weight=0.1):
    # Standard SFT loss on chosen response
    sft_loss = -chosen_logprobs.mean()
 
    # Odds ratio: contrasts likelihood of chosen vs rejected
    log_odds_chosen = chosen_logprobs - torch.log1p(-chosen_logprobs.exp())
    log_odds_rejected = rejected_logprobs - torch.log1p(-rejected_logprobs.exp())
    odds_ratio_loss = -F.logsigmoid(log_odds_chosen - log_odds_rejected)
 
    return sft_loss + lambda_weight * odds_ratio_loss.mean()

By collapsing two training stages into one, ORPO reduces total compute, eliminates the SFT-to-alignment distribution shift, and simplifies the training pipeline.

IPO: Identity Preference Optimization

IPO (Azar et al., 2024) addresses a subtle failure mode in DPO: when preference margins are large, DPO can overfit, driving the policy too far from the reference model. IPO adds a regularization term that directly constrains how much the preference gap can grow:

L_IPO = (log(pi(y_w|x)/ref(y_w|x)) - log(pi(y_l|x)/ref(y_l|x)) - 1/(2*beta))^2

Instead of the sigmoid-based loss in DPO, IPO uses a squared loss centered on a target margin of 1/(2*beta). This prevents the model from endlessly pushing chosen and rejected outputs apart and keeps training stable.

Why It Matters

Data efficiency: KTO's ability to use unpaired binary feedback slashes data collection costs by 50% or more, since each piece of feedback stands alone rather than requiring a matched pair. Real-world feedback is overwhelmingly unpaired.
Memory efficiency: SimPO removes the reference model from GPU memory during training. For a 70B parameter model in bf16, this saves approximately 140 GB of VRAM -- potentially the difference between fitting on one node versus needing two.
Pipeline simplification: ORPO collapses SFT and alignment into one stage, eliminating an entire training phase, the associated hyperparameter tuning, and the distribution shift between stages.
Training stability: IPO's regularization directly prevents the overfitting to preference margins that plagues DPO on noisy or imbalanced preference datasets.
Flexibility for deployment contexts: Different production scenarios favor different methods -- KTO for consumer apps with like/dislike buttons, SimPO for memory-constrained fine-tuning, ORPO for rapid iteration cycles.

Key Technical Details

KTO requires roughly a 1:1 ratio of desirable to undesirable examples for stable training; extreme imbalance degrades performance.
SimPO's gamma margin hyperparameter typically ranges from 1.0 to 1.8; values too high cause underfitting, too low reduce preference separation.
ORPO's lambda weighting between SFT and odds-ratio loss needs careful tuning -- too high collapses generation diversity, too low provides weak alignment signal.
IPO's squared loss can slow convergence relative to DPO's sigmoid loss, often requiring 1.5-2x more training steps.
All variants still assume preference data quality is high; garbage-in-garbage-out applies regardless of algorithmic sophistication.
SimPO has shown strong results on AlpacaEval and Arena-Hard benchmarks, sometimes outperforming DPO despite its simpler formulation.
KTO and DPO converge to similar performance when given the same underlying preference distribution, but diverge when data is noisy or sparse.

Common Misconceptions

"KTO always underperforms DPO because it uses less information per example." In practice, KTO often matches DPO quality because real-world preference pairs are noisy, and KTO's loss function is more robust to label noise. The information loss from unpairing is partially offset by being able to use far more data.
"SimPO's lack of a reference model means it has no regularization." SimPO uses length-normalized average log probability and a target margin as implicit regularization. The length normalization prevents reward hacking via verbosity, and the margin prevents collapse.
"ORPO eliminates the need for preference data." ORPO still requires paired chosen/rejected responses. What it eliminates is the separate SFT stage, not the preference data itself.
"These variants make DPO obsolete." DPO remains the best-understood method with the strongest theoretical guarantees. When you have high-quality paired data and sufficient compute, DPO is still a strong default. The variants shine in specific constraint scenarios.

Connections to Other Concepts

04-direct-preference-optimization.md: The foundational method all variants extend or modify; understanding DPO's loss function is essential.
rlhf.md: The original paradigm that DPO and its variants aim to simplify; all share the same underlying Bradley-Terry preference model assumption (except KTO, which uses prospect theory).
reward-modeling.md: SimPO and ORPO bypass explicit reward models; KTO reframes the reward signal from comparative to absolute.
goodharts-law.md: All preference methods risk optimizing a proxy for human intent; IPO's regularization directly mitigates this.
constitutional-ai.md: An orthogonal alignment approach that generates preference data from principles, which can then be used with any of these training methods.