Inverse Reinforcement Learning

One-Line Summary: Inverse reinforcement learning recovers the reward function that an expert is implicitly optimizing, answering "what are they trying to do?" rather than "how are they doing it?"

Prerequisites: Markov decision processes, policy optimization, maximum entropy models, Bellman equations

What Is Inverse Reinforcement Learning?

Imagine watching a master chef prepare a complex dish. You could try to copy their every move -- chop here, stir there -- but you would fail the moment the recipe changes. Instead, if you understood their underlying goals (balance flavors, achieve certain textures, present beautifully), you could cook any dish well. Inverse reinforcement learning (IRL) is this deeper form of understanding: rather than imitating actions, it infers the reward function that explains observed behavior.

Standard RL takes a reward function and finds an optimal policy. IRL inverts this: given an optimal (or near-optimal) policy demonstrated through trajectories, it recovers the reward function $R$ that makes the observed behavior optimal. This is fundamentally harder because the mapping from rewards to behavior is many-to-one -- many different reward functions can produce the same optimal policy.

How It Works

The IRL Problem

Given a set of expert demonstrations $D = {τ_{1}, τ_{2}, \dots, τ_{M}}$ where each trajectory $τ = (s_{0}, a_{0}, s_{1}, a_{1}, \dots)$ is generated by an expert policy $π^{*}$ , IRL seeks a reward function $R_{θ} (s, a)$ such that:

$π^{*} = ar g max_{π} E_{π} [\sum_{t = 0}^{\infty} γ^{t} R_{θ} (s_{t}, a_{t})]$

The classic approach (Ng and Russell, 2000) formulates this as finding $R$ such that the expert's policy achieves higher value than any other policy. However, the trivial solution $R = 0$ always satisfies this, revealing the fundamental reward ambiguity problem.

Feature Matching and Reward Ambiguity

Early IRL methods represent the reward as a linear combination of features: $R_{θ} (s, a) = θ^{T} ϕ (s, a)$ . The key insight is that a policy's expected return depends only on its feature expectations:

$μ (π) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} ϕ (s_{t}, a_{t})]$

IRL then seeks $θ$ such that $θ^{T} μ (π^{*}) \geq θ^{T} μ (π)$ for all $π$ . This reduces to finding a reward vector where the expert's feature expectations dominate. But the set of valid $θ$ is a convex cone, so additional constraints are needed to select a unique solution.

Maximum Entropy IRL

Maximum entropy IRL (Ziebart et al., 2008) resolves the ambiguity elegantly by assuming the expert follows a Boltzmann-rational policy where the probability of a trajectory is exponential in its return:

$P (τ ∣ θ) = \frac{1}{Z ( θ )} exp (\sum_{t = 0}^{T} R_{θ} (s_{t}, a_{t}))$

where $Z (θ) = \sum_{τ} exp (\sum_{t} R_{θ} (s_{t}, a_{t}))$ is the partition function. The reward parameters are found by maximum likelihood:

$θ^{*} = ar g max_{θ} \sum_{τ \in D} lo g P (τ ∣ θ)$

The gradient takes an intuitive form -- matching feature expectations between the expert and the model:

$\nabla_{θ} L = μ (π^{*}) - E_{τ \sim P (\cdot ∣ θ)} [μ (τ)]$

This means: adjust the reward so that the expected features under the induced policy match the expert's features.

Bayesian IRL

Bayesian IRL (Ramachandran and Amir, 2007) maintains a posterior distribution over reward functions:

$P (R ∣ D) \propto P (D ∣ R) P (R)$

where the likelihood $P (D ∣ R)$ assumes the expert is Boltzmann-rational with temperature $β$ : $P (a ∣ s, R) \propto exp (β Q^{R} (s, a))$ . The posterior captures the full set of reward functions consistent with the demonstrations, naturally handling ambiguity. However, sampling from this posterior is computationally expensive.

Generative Adversarial Imitation Learning (GAIL)

GAIL (Ho and Ermon, 2016) connects IRL to generative adversarial networks. A discriminator $D_{ϕ} (s, a)$ distinguishes expert state-action pairs from those generated by the learned policy, while the policy (generator) $π_{θ}$ tries to fool the discriminator:

$min_{θ} max_{ϕ} E_{π^{*}} [lo g D_{ϕ} (s, a)] + E_{π_{θ}} [lo g (1 - D_{ϕ} (s, a))]$

The critical insight is that $- lo g (1 - D_{ϕ} (s, a))$ acts as a learned reward function for the policy. GAIL avoids explicitly recovering a reward, instead using the discriminator as an implicit reward signal during policy optimization. Ho and Ermon showed that GAIL is equivalent to performing maximum entropy IRL with a specific class of reward functions.

Connection to RLHF

Reinforcement Learning from Human Feedback (RLHF) applies IRL principles to language models. Instead of full demonstrations, humans provide preferences between pairs of outputs. The reward model is trained via the Bradley-Terry model:

$P (τ_{1} ≻ τ_{2}) = σ (R_{θ} (τ_{1}) - R_{θ} (τ_{2}))$

This preference-based IRL avoids requiring humans to demonstrate optimal behavior -- they only need to judge which of two outputs is better.

Why It Matters

IRL is essential whenever reward functions are difficult to specify manually. Autonomous driving, robotic manipulation, and AI alignment all face this challenge: human values and preferences are complex, context-dependent, and hard to formalize. IRL provides a principled framework for learning what to optimize from human behavior. Its extension to RLHF has become the dominant approach for aligning large language models.

Key Technical Details

Maximum entropy IRL requires solving the forward RL problem in an inner loop at each gradient step, making it computationally expensive
GAIL eliminates the inner RL loop for reward recovery but still requires environment interaction for policy optimization
The reward ambiguity problem means recovered rewards are only identifiable up to potential-based shaping (a direct connection to reward-shaping.md)
Deep IRL methods parameterize $R_{θ}$ with neural networks, enabling learning from raw observations
MaxEnt IRL produces a stochastic optimal policy, not a deterministic one -- this is a feature, not a bug, as it captures variability in expert behavior
GAIL typically requires $1 0^{6}$ -- $1 0^{7}$ environment interactions even with few demonstrations

Common Misconceptions

"IRL recovers the true reward function": Due to reward ambiguity, IRL can only recover a reward function that is consistent with observed behavior. Infinitely many reward functions produce the same optimal policy. The recovered reward may not match the expert's "true" internal objective.
"IRL requires optimal demonstrations": Maximum entropy IRL and Bayesian IRL explicitly model suboptimality through the Boltzmann rationality assumption. They handle noisy demonstrations gracefully, assigning lower probability to suboptimal actions.
"GAIL is just imitation learning": While GAIL produces a policy that imitates the expert, its mechanism is fundamentally IRL -- it learns a reward signal (the discriminator) and optimizes against it.

Connections to Other Concepts

IRL provides the reward function that imitation-learning.md methods can then optimize, offering a two-step alternative to direct behavioral cloning
The recovered reward can be used with reward-shaping.md techniques to accelerate learning on related tasks
GAIL bridges IRL and imitation-learning.md, combining reward inference with policy learning
Preference-based IRL underlies the RLHF pipeline critical to modern language model alignment
offline-reinforcement-learning.md methods can optimize the IRL-recovered reward on the demonstration dataset itself