One-Line Summary: Inverse reinforcement learning recovers the reward function that an expert is implicitly optimizing, answering "what are they trying to do?" rather than "how are they doing it?"
Prerequisites: Markov decision processes, policy optimization, maximum entropy models, Bellman equations
What Is Inverse Reinforcement Learning?
Imagine watching a master chef prepare a complex dish. You could try to copy their every move -- chop here, stir there -- but you would fail the moment the recipe changes. Instead, if you understood their underlying goals (balance flavors, achieve certain textures, present beautifully), you could cook any dish well. Inverse reinforcement learning (IRL) is this deeper form of understanding: rather than imitating actions, it infers the reward function that explains observed behavior.
Standard RL takes a reward function and finds an optimal policy. IRL inverts this: given an optimal (or near-optimal) policy demonstrated through trajectories, it recovers the reward function that makes the observed behavior optimal. This is fundamentally harder because the mapping from rewards to behavior is many-to-one -- many different reward functions can produce the same optimal policy.
How It Works
The IRL Problem
Given a set of expert demonstrations where each trajectory is generated by an expert policy , IRL seeks a reward function such that:
The classic approach (Ng and Russell, 2000) formulates this as finding such that the expert's policy achieves higher value than any other policy. However, the trivial solution always satisfies this, revealing the fundamental reward ambiguity problem.
Feature Matching and Reward Ambiguity
Early IRL methods represent the reward as a linear combination of features: . The key insight is that a policy's expected return depends only on its feature expectations:
IRL then seeks such that for all . This reduces to finding a reward vector where the expert's feature expectations dominate. But the set of valid is a convex cone, so additional constraints are needed to select a unique solution.
Maximum Entropy IRL
Maximum entropy IRL (Ziebart et al., 2008) resolves the ambiguity elegantly by assuming the expert follows a Boltzmann-rational policy where the probability of a trajectory is exponential in its return:
where is the partition function. The reward parameters are found by maximum likelihood:
The gradient takes an intuitive form -- matching feature expectations between the expert and the model:
This means: adjust the reward so that the expected features under the induced policy match the expert's features.
Bayesian IRL
Bayesian IRL (Ramachandran and Amir, 2007) maintains a posterior distribution over reward functions:
where the likelihood assumes the expert is Boltzmann-rational with temperature : . The posterior captures the full set of reward functions consistent with the demonstrations, naturally handling ambiguity. However, sampling from this posterior is computationally expensive.
Generative Adversarial Imitation Learning (GAIL)
GAIL (Ho and Ermon, 2016) connects IRL to generative adversarial networks. A discriminator distinguishes expert state-action pairs from those generated by the learned policy, while the policy (generator) tries to fool the discriminator:
The critical insight is that acts as a learned reward function for the policy. GAIL avoids explicitly recovering a reward, instead using the discriminator as an implicit reward signal during policy optimization. Ho and Ermon showed that GAIL is equivalent to performing maximum entropy IRL with a specific class of reward functions.
Connection to RLHF
Reinforcement Learning from Human Feedback (RLHF) applies IRL principles to language models. Instead of full demonstrations, humans provide preferences between pairs of outputs. The reward model is trained via the Bradley-Terry model:
This preference-based IRL avoids requiring humans to demonstrate optimal behavior -- they only need to judge which of two outputs is better.
Why It Matters
IRL is essential whenever reward functions are difficult to specify manually. Autonomous driving, robotic manipulation, and AI alignment all face this challenge: human values and preferences are complex, context-dependent, and hard to formalize. IRL provides a principled framework for learning what to optimize from human behavior. Its extension to RLHF has become the dominant approach for aligning large language models.
Key Technical Details
- Maximum entropy IRL requires solving the forward RL problem in an inner loop at each gradient step, making it computationally expensive
- GAIL eliminates the inner RL loop for reward recovery but still requires environment interaction for policy optimization
- The reward ambiguity problem means recovered rewards are only identifiable up to potential-based shaping (a direct connection to
reward-shaping.md) - Deep IRL methods parameterize with neural networks, enabling learning from raw observations
- MaxEnt IRL produces a stochastic optimal policy, not a deterministic one -- this is a feature, not a bug, as it captures variability in expert behavior
- GAIL typically requires -- environment interactions even with few demonstrations
Common Misconceptions
- "IRL recovers the true reward function": Due to reward ambiguity, IRL can only recover a reward function that is consistent with observed behavior. Infinitely many reward functions produce the same optimal policy. The recovered reward may not match the expert's "true" internal objective.
- "IRL requires optimal demonstrations": Maximum entropy IRL and Bayesian IRL explicitly model suboptimality through the Boltzmann rationality assumption. They handle noisy demonstrations gracefully, assigning lower probability to suboptimal actions.
- "GAIL is just imitation learning": While GAIL produces a policy that imitates the expert, its mechanism is fundamentally IRL -- it learns a reward signal (the discriminator) and optimizes against it.
Connections to Other Concepts
- IRL provides the reward function that
imitation-learning.mdmethods can then optimize, offering a two-step alternative to direct behavioral cloning - The recovered reward can be used with
reward-shaping.mdtechniques to accelerate learning on related tasks - GAIL bridges IRL and
imitation-learning.md, combining reward inference with policy learning - Preference-based IRL underlies the RLHF pipeline critical to modern language model alignment
offline-reinforcement-learning.mdmethods can optimize the IRL-recovered reward on the demonstration dataset itself
Further Reading
- Ng and Russell, "Algorithms for Inverse Reinforcement Learning" (2000): The foundational IRL paper establishing the problem formulation and linear programming approach.
- Ziebart et al., "Maximum Entropy Inverse Reinforcement Learning" (2008): Introduces the maximum entropy principle to IRL, resolving reward ambiguity and providing a probabilistic framework.
- Ho and Ermon, "Generative Adversarial Imitation Learning" (2016): Connects IRL to GANs, enabling scalable IRL without explicit reward recovery.
- Ramachandran and Amir, "Bayesian Inverse Reinforcement Learning" (2007): Full posterior inference over reward functions, providing uncertainty quantification.
- Christiano et al., "Deep Reinforcement Learning from Human Preferences" (2017): Extends IRL to preference-based learning, laying groundwork for RLHF in language models.