One-Line Summary: Expected future return from a state (V) or state-action pair (Q) -- the backbone of most RL algorithms.
Prerequisites: markov-decision-processes.md, return-and-discount-factor.md, policies.md.
What Are Value Functions?
Imagine you are house-hunting. Some neighborhoods are objectively more desirable: good schools, low crime, close to transit. The "value" of being in a neighborhood captures not just what you experience today, but all the future benefits of living there. In RL, a value function does exactly this -- it estimates the total future reward an agent can expect from a given situation, accounting for everything that will happen from that point onward.
Value functions are the agent's internal estimate of "how good is it to be here?" They compress the infinite complexity of future trajectories into a single number per state, enabling the agent to make locally informed decisions with globally optimal consequences.
How It Works
State-Value Function
The state-value function under policy gives the expected return starting from state and following thereafter:
This answers: "How good is it to be in state if I follow policy ?"
Action-Value Function
The action-value function (or Q-function) under policy gives the expected return starting from state , taking action , and following thereafter:
This answers: "How good is it to take action in state and then follow policy ?"
Relationship Between V and Q
The two value functions are intimately connected:
The state-value is the policy-weighted average of the action-values. Conversely:
The action-value equals the immediate reward plus the discounted value of the next state, averaged over transition uncertainty.
The Advantage Function
The advantage function measures how much better action is compared to the average action under :
Key properties:
- (the advantage is zero on average).
- means action is better than the policy's average.
- The advantage function is central to policy gradient methods (A2C, A3C, PPO, GAE).
Optimal Value Functions
The optimal state-value function is the maximum value achievable from state under any policy:
The optimal action-value function is the maximum expected return achievable starting from :
Once is known, the optimal policy is immediately available:
This is why Q-learning and DQN focus on learning -- the optimal policy falls out as a byproduct.
Computing Value Functions
Tabular case. For small state spaces, and are stored as tables (arrays). requires entries; requires entries.
Function approximation. For large or continuous spaces, value functions are approximated:
- Linear: , where is a feature vector.
- Neural network: , where is a deep network. DQN uses a CNN that takes frames as input and outputs for each of 18 Atari actions.
Monte Carlo estimation. Estimate by averaging observed returns from state over many episodes:
Temporal-difference (TD) estimation. Update after each step using bootstrapping:
The term is called the TD error.
Value Function Geometry
For a finite MDP with states, the value function is a vector in . The set of all achievable value functions forms a polytope in this space. The optimal value function sits at a vertex of this polytope, and policy improvement moves toward it.
Why It Matters
Value functions are the workhorse of RL. They enable:
- Policy evaluation: Assessing how good a policy is without running it indefinitely.
- Policy improvement: Acting greedily with respect to produces a better policy.
- Planning: In model-based RL, value functions guide lookahead search (e.g., AlphaZero uses a learned to evaluate board positions in MCTS).
- Credit assignment: Value functions propagate information about future rewards backward through time, solving the credit assignment problem.
Key Technical Details
- DQN (Mnih et al., 2015) approximates with a CNN and uses experience replay (buffer of transitions) and a target network (updated every steps) for stability.
- Double Q-learning (van Hasselt et al., 2016) addresses overestimation bias in Q-learning by decoupling action selection from evaluation.
- Dueling networks (Wang et al., 2016) decompose architecturally, sharing representation for the state-value.
- Value function approximation can diverge in the off-policy, function approximation, bootstrapping setting (the "deadly triad" identified by Sutton & Barto).
- For continuous actions, representing as a table or discrete output is impossible. Algorithms like DDPG and SAC use a separate network taking both and as input.
Common Misconceptions
"V and Q contain different information." They encode the same information differently. Given the MDP dynamics, and are fully interconvertible. is more directly useful for action selection because you can compare actions without knowing the transition model.
"Higher V(s) means the state is inherently better." depends on the policy . A state might have high value under a good policy and low value under a bad one. Only reflects the intrinsic quality of a state.
"Value functions are always accurate after training." Function approximation introduces systematic errors. Overestimation bias is a well-documented issue in Q-learning with function approximation, motivating techniques like double Q-learning and clipped double Q (used in TD3 and SAC).
"You always need value functions for RL." Pure policy gradient methods (e.g., REINFORCE) learn policies without explicitly maintaining a value function. However, adding a value function baseline dramatically reduces variance, which is why actor-critic methods dominate in practice.
Connections to Other Concepts
return-and-discount-factor.md-- Value functions are expected returns.policies.md-- Value functions evaluate policies; the optimal policy is derived from .bellman-equations.md-- The recursive equations that value functions satisfy.markov-decision-processes.md-- Value functions are defined within the MDP framework.exploration-vs-exploitation.md-- Value estimates guide the exploitation side of the tradeoff.
Further Reading
- Sutton & Barto (2018) -- Reinforcement Learning: An Introduction, Chapters 3-6. Comprehensive treatment from definition through Monte Carlo and TD estimation.
- Mnih et al. (2015) -- "Human-level control through deep reinforcement learning." Nature, 518. DQN: the breakthrough in neural value function approximation.
- van Hasselt et al. (2016) -- "Deep reinforcement learning with double Q-learning." AAAI. Identifies and corrects overestimation bias in DQN.
- Wang et al. (2016) -- "Dueling network architectures for deep reinforcement learning." ICML. Introduces the V + A decomposition for Q-networks.
- Baird (1995) -- "Residual algorithms: Reinforcement learning with function approximation." ICML. Early identification of divergence issues with value function approximation.