One-Line Summary: The recursive decomposition of value into immediate reward plus discounted future value -- the fundamental identity of RL.

Prerequisites: markov-decision-processes.md, return-and-discount-factor.md, policies.md, value-functions.md.

What Are the Bellman Equations?

Imagine planning a cross-country road trip. You don't need to evaluate the entire route at once. Instead, you reason: "The value of being in Denver equals the pleasure of driving the next leg to Salt Lake City, plus the value of being in Salt Lake City." You decompose a global evaluation into a local step plus the value of where you end up. That recursive insight -- the value of now equals the reward of the next step plus the discounted value of later -- is the Bellman equation.

Named after Richard Bellman (1957), these equations are the mathematical backbone of nearly every RL algorithm. They convert the problem of evaluating an infinite sum of future rewards into a system of simultaneous equations, one for each state.

How It Works

Bellman Expectation Equation for

Starting from the definition of the state-value function and using the recursive structure of the return ():

Expanding the expectation over actions and next states:

This says: the value of state under policy equals the expected immediate reward plus the discounted value of the next state, averaged over the policy's action choices and the environment's stochastic transitions.

Bellman Expectation Equation for

Similarly, for the action-value function:

This decomposes into the immediate reward plus the discounted Q-value of the next state-action pair, weighted by the policy at .

The Four Bellman Equations (Summary)

EquationRecursive Form

Bellman Optimality Equations

The Bellman optimality equation for replaces the policy average with a maximization:

For :

The key difference: the expectation equations use to weight actions, while the optimality equations use . The optimality equations are nonlinear (due to the ) and generally cannot be solved in closed form.

Derivation Sketch

Starting from :

  1. Substitute
  2. Apply the law of total expectation, conditioning on and
  3. Use the Markov property:
  4. The result is the Bellman expectation equation

For the optimality equation, replace the policy average with , reflecting that the optimal policy always selects the best action.

Matrix Form (Finite MDPs)

For a finite MDP with states, the Bellman expectation equation for is a linear system:

where , is the expected reward vector, and is the state transition matrix under . The closed-form solution is:

This requires computation for the matrix inverse, which is feasible for small MDPs but intractable for large ones ( states would require operations).

The Bellman Operator

Define the Bellman operator for policy evaluation:

And the Bellman optimality operator :

Both operators are -contractions in the norm:

By the Banach fixed-point theorem, each operator has a unique fixed point, and repeated application converges to it. This is the theoretical foundation for:

  • Policy evaluation (iterative application of ) converging to .
  • Value iteration (iterative application of ) converging to .

Algorithms Built on Bellman Equations

AlgorithmBellman Equation UsedApproach
Policy Evaluation expectationIterative
Value Iteration optimalityIterative
Q-Learning optimalitySample-based
SARSA expectationSample-based
TD(0) expectationSample-based

Why It Matters

The Bellman equations transform the RL problem from "evaluate an infinite sum of future rewards" to "solve a system of equations relating neighboring states." This recursive structure enables:

  • Dynamic programming algorithms that solve MDPs exactly when the model is known.
  • Temporal-difference learning that bootstraps (estimates values from other estimates), enabling online, incremental learning.
  • Convergence guarantees through the contraction mapping property.

Without Bellman equations, RL would require evaluating complete trajectories for every value estimate -- the Bellman decomposition makes tractable learning possible.

Key Technical Details

  • Value iteration converges at rate : after iterations, . For , reaching -accuracy requires iterations.
  • Policy iteration alternates between solving (policy evaluation) and updating greedily (policy improvement). It converges in at most iterations (the number of deterministic policies), but in practice converges much faster.
  • The deadly triad: Combining bootstrapping (Bellman equations), function approximation, and off-policy learning can cause divergence. This fundamental tension motivates algorithms like fitted Q-iteration, gradient TD, and emphatic TD.
  • Q-learning's update rule is a stochastic approximation to the Bellman optimality equation: .

Common Misconceptions

"The Bellman equation is an update rule." The Bellman equation is an identity -- a condition that the true value function must satisfy. Algorithms like TD learning and value iteration use update rules inspired by the Bellman equation, but the equation itself is a mathematical relationship, not a procedure.

"You need the transition model to use Bellman equations." Model-free methods (TD, Q-learning, SARSA) use sample-based Bellman updates, replacing the expectation over transitions with single observed transitions. The model is not required.

"Bellman optimality equations can be solved directly." The operator makes them nonlinear. Unlike the expectation equations, there is no closed-form solution. They must be solved iteratively (value iteration) or via linear programming.

Connections to Other Concepts

  • value-functions.md -- Bellman equations define the recursive structure of value functions.
  • return-and-discount-factor.md -- The recursive decomposition is the starting point.
  • policies.md -- The expectation equations depend on ; the optimality equations define .
  • markov-decision-processes.md -- Bellman equations exploit the Markov property.
  • exploration-vs-exploitation.md -- Q-learning (based on Bellman optimality) is combined with exploration strategies.

Further Reading

  • Bellman (1957) -- Dynamic Programming. The origin of the principle of optimality and the Bellman equation.
  • Sutton & Barto (2018) -- Reinforcement Learning: An Introduction, Chapters 3-4. Derivation and use of Bellman equations in dynamic programming.
  • Bertsekas (2012) -- Dynamic Programming and Optimal Control (4th edition). Rigorous treatment of Bellman equations, contraction mappings, and convergence theory.
  • Watkins & Dayan (1992) -- "Q-learning." Machine Learning, 8. Proves convergence of Q-learning to the Bellman optimality fixed point.
  • Tsitsiklis & Van Roy (1997) -- "An analysis of temporal-difference learning with function approximation." IEEE Transactions on Automatic Control. Foundational analysis of TD learning and Bellman equation approximation.