Dynamic Programming

One-Line Summary: Computing optimal policies via iterative Bellman updates when the full environment model is known -- the theoretical foundation of reinforcement learning.

Prerequisites: markov-decision-processes.md, value-functions.md, bellman-equations.md, policies.md

What Is Dynamic Programming?

Imagine you have a complete map of a city -- every road, every traffic delay, every shortcut. With this map in hand, you don't need to wander around discovering routes. You can sit at your desk and compute the best route from any location to any destination by working backward from the goal. Dynamic programming (DP) in reinforcement learning works exactly this way: given perfect knowledge of the environment's transition dynamics $p (s^{'}, r ∣ s, a)$ , it computes optimal policies through systematic, iterative computation rather than trial-and-error interaction.

DP methods are planning algorithms. They require a complete model of the MDP (all transition probabilities and rewards), which makes them impractical for most real-world problems. Yet they are indispensable as theoretical foundations -- virtually every RL algorithm can be understood as an approximation to one of the two core DP algorithms: policy iteration or value iteration.

The term "dynamic programming" was coined by Richard Bellman in the 1950s. The "programming" refers to optimization (as in linear programming), and "dynamic" refers to the sequential, multi-stage nature of the problems.

How It Works

Policy Evaluation (Prediction)

Given a fixed policy $π$ , policy evaluation computes the state-value function $V^{π} (s)$ by iteratively applying the Bellman expectation equation as an update rule:

$V_{k + 1} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ V_{k} (s^{'})]$

Starting from an arbitrary $V_{0}$ , each sweep over all states produces a better approximation. The sequence ${V_{k}}$ converges to $V^{π}$ as $k \to \infty$ . In practice, convergence is declared when $max_{s} ∣ V_{k + 1} (s) - V_{k} (s) ∣ < θ$ for some small threshold $θ$ .

Recommended visual: Gridworld value function converging over successive sweeps. Source: Sutton & Barto, Reinforcement Learning: An Introduction, Figure 4.1.

Policy Improvement

Once we have $V^{π}$ , we can construct a strictly better (or equal) policy by acting greedily with respect to the current value function:

$π^{'} (s) = ar g max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ V^{π} (s^{'})]$

The policy improvement theorem guarantees that if $π^{'} \neq = π$ , then $V^{π^{'}} (s) \geq V^{π} (s)$ for all states $s$ . If $π^{'} = π$ , then $π$ is already optimal.

Policy Iteration

Policy iteration alternates between evaluation and improvement:

Initialize $π$ arbitrarily.
Policy Evaluation: Compute $V^{π}$ (iterative sweeps until convergence).
Policy Improvement: Derive $π^{'}$ by acting greedily w.r.t. $V^{π}$ .
If $π^{'} = π$ , stop (optimal policy found). Otherwise, set $π \leftarrow π^{'}$ and go to step 2.

Since a finite MDP has a finite number of deterministic policies, and each iteration produces a strictly better policy (unless already optimal), policy iteration terminates in a finite number of steps. In practice, convergence is remarkably fast -- often in fewer than 10 iterations even for large state spaces.

Value Iteration

Value iteration collapses evaluation and improvement into a single update by applying the Bellman optimality equation directly:

$V_{k + 1} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ V_{k} (s^{'})]$

This is equivalent to performing a single sweep of policy evaluation followed by policy improvement at every step. The optimal policy is extracted at the end:

$π^{*} (s) = ar g max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ V^{*} (s^{'})]$

Value iteration converges to $V^{*}$ under the same conditions as policy evaluation, since the Bellman optimality operator is a contraction mapping with factor $γ$ .

Convergence Guarantees

Both algorithms rest on the contraction mapping theorem. The Bellman operator $T$ satisfies:

$∥ T V_{1} - T V_{2} ∥_{\infty} \leq γ ∥ V_{1} - V_{2} ∥_{\infty}$

This guarantees a unique fixed point and geometric convergence at rate $γ$ . After $k$ iterations, the error is bounded by $γ^{k} ∥ V_{0} - V^{*} ∥_{\infty}$ .

Why It Matters

DP provides the correctness benchmark for all of reinforcement learning. When we say Q-learning "converges to the optimal policy," we mean it converges to the same solution DP would compute -- but without requiring the model. Every model-free algorithm (MC, TD, Q-learning, SARSA) is, in a deep sense, trying to approximate DP using sampled experience.

DP methods are also directly practical for moderate-scale problems where the model is known: inventory management, queueing systems, and MDPs with up to millions of states can be solved exactly with modern implementations.

Key Technical Details

Computational complexity: Each sweep of policy evaluation or value iteration costs $O (∣ S ∣^{2} ∣ A ∣)$ -- iterating over all states, all actions, and all successor states.
Policy iteration typically converges in very few outer iterations (often $< 10$ ) but each policy evaluation step may require many sweeps.
Value iteration requires more total sweeps but each sweep is cheaper (no inner convergence loop).
Asynchronous DP updates states in any order, enabling prioritized sweeping of states with the largest Bellman residuals.
Discount factor $γ < 1$ is required for convergence guarantees in infinite-horizon problems. For episodic tasks with absorbing terminal states, $γ = 1$ is also valid.
In-place updates (updating $V (s)$ immediately rather than maintaining separate $V_{k}$ and $V_{k + 1}$ arrays) often converge faster in practice.

Common Misconceptions

"DP requires visiting states through interaction." No. DP uses the model to compute expectations analytically. It never generates a single trajectory. This is what makes it a planning method, not a learning method.
"Value iteration is always faster than policy iteration." Not necessarily. Policy iteration often converges in fewer total computation steps because each policy evaluation phase can use warm-starting. The comparison depends on the specific MDP structure.
"DP is only of theoretical interest." DP is used in production systems for operations research, logistics, and finance -- anywhere the model is known and the state space is tractable.

Connections to Other Concepts

bellman-equations.md -- DP directly operationalizes the Bellman equations as iterative update rules.
value-functions.md -- DP computes the exact value functions that model-free methods estimate from samples.
monte-carlo-methods.md and temporal-difference-learning.md -- These approximate DP when the model is unknown, using sampled returns or bootstrapped estimates.
q-learning.md -- Asynchronous, sample-based approximation to value iteration applied to Q-values.
dyna-architecture.md -- Combines model-free learning with model-based DP-style planning, bridging both paradigms.