Policies

One-Line Summary: The agent's decision rule mapping states to actions -- the central object that RL algorithms learn.

Prerequisites: what-is-reinforcement-learning.md, markov-decision-processes.md, states-actions-rewards.md.

What Is a Policy?

A policy is like a playbook. A basketball coach's playbook says: "When the shot clock is under 5 seconds and we're trailing, run play X." It maps situations to decisions. In RL, a policy $π$ maps states to actions -- it is the complete specification of how an agent behaves. If you know the policy, you know exactly what the agent will do (or the probability of what it will do) in every possible situation.

The entire goal of reinforcement learning can be stated in one sentence: find the best policy. Every algorithm -- value-based, policy-based, or model-based -- is ultimately a different strategy for searching the space of policies.

How It Works

Formal Definition

A policy $π$ is a mapping from states to actions (or distributions over actions):

$π : S \to A (deterministic)$

$π (a ∣ s) = Pr (A_{t} = a ∣ S_{t} = s) (stochastic)$

A stochastic policy assigns a probability to each action in each state, satisfying:

$\sum_{a \in A} π (a ∣ s) = 1, π (a ∣ s) \geq 0 \forall s, a$

Deterministic vs. Stochastic Policies

Deterministic policies select a single action in each state: $a = π (s)$ . They are simpler and, in fully observable MDPs, sufficient -- there always exists a deterministic optimal policy (Puterman, 1994).

Stochastic policies assign probabilities to actions: $a \sim π (\cdot ∣ s)$ . They are essential in several scenarios:

Partial observability (POMDPs): When the agent cannot distinguish between states, a stochastic policy can outperform any deterministic one.
Multi-agent settings: Mixed strategies (stochastic policies) are necessary for Nash equilibria in competitive games (e.g., rock-paper-scissors has no deterministic equilibrium).
Exploration: Stochastic policies naturally explore by sampling different actions.
Policy gradient methods: Algorithms like REINFORCE and PPO optimize stochastic policies, using the gradient of the expected return with respect to policy parameters.

Policy Parameterization

In practice, policies are represented by parameterized functions $π_{θ}$ with parameters $θ$ .

Tabular policies. For small, discrete state-action spaces, the policy is stored as a lookup table with $∣ S ∣ \times ∣ A ∣$ entries. Each entry stores $π (a ∣ s)$ .

Linear policies. The action probabilities are a linear function of state features:

$π_{θ} (a ∣ s) = softmax (θ^{⊤} ϕ (s, a))$

where $ϕ (s, a)$ is a feature vector.

Neural network policies. A neural network takes the state as input and outputs either:

Action probabilities (discrete): $π_{θ} (a ∣ s) = softmax (f_{θ} (s))$
Distribution parameters (continuous): $μ_{θ} (s), σ_{θ} (s)$ for a Gaussian $π_{θ} (a ∣ s) = N (a ∣ μ_{θ} (s), σ_{θ}^{2} (s))$

The Optimal Policy

The optimal policy $π^{*}$ achieves the highest expected return from every state simultaneously:

$π^{*} = ar g max_{π} V^{π} (s) \forall s \in S$

where $V^{π} (s)$ is the state-value function under $π$ (see value-functions.md).

A fundamental theorem of MDPs (Puterman, 1994): for any finite MDP, there exists at least one deterministic optimal policy, and it can be derived from the optimal action-value function:

$π^{*} (s) = ar g max_{a} Q^{*} (s, a)$

This result is why value-based methods (like Q-learning) work: learn $Q^{*}$ , then act greedily.

Policy Classes

Greedy policies always choose the action with the highest estimated value:

$π_{greedy} (s) = ar g max_{a} Q (s, a)$

Epsilon-greedy policies explore with probability $ϵ$ and exploit with probability $1 - ϵ$ (see exploration-vs-exploitation.md):

$π_{ϵ} (a ∣ s) = {1 - ϵ + \frac{ϵ}{∣ A ∣} \frac{ϵ}{∣ A ∣} if a = ar g max_{a^{'}} Q (s, a^{'}) otherwise$

Softmax (Boltzmann) policies choose actions proportionally to exponentiated values:

$π_{τ} (a ∣ s) = \frac{e x p ( Q ( s , a ) / τ )}{\sum _{a^{'}} e x p ( Q ( s , a ^{'} ) / τ )}$

where $τ > 0$ is a temperature parameter. As $τ \to 0$ , this converges to the greedy policy; as $τ \to \infty$ , it becomes uniform.

Behavior Policy vs. Target Policy

In off-policy learning, two policies coexist:

The behavior policy $b (a ∣ s)$ generates the data (the actions actually taken).
The target policy $π (a ∣ s)$ is the policy being evaluated or improved.

Q-learning uses an epsilon-greedy behavior policy while learning the greedy (optimal) target policy. Importance sampling corrects for the mismatch: $ρ_{t} = \frac{π ( A _{t} ∣ S _{t} )}{b ( A _{t} ∣ S _{t} )}$ .

Why It Matters

The policy is the deliverable of an RL system. After training, the policy is deployed -- it is what the robot executes, what the game AI uses, what the recommendation engine applies. Choosing the right policy representation (tabular, linear, neural network) and the right policy class (deterministic, stochastic, parameterized) directly determines the expressiveness, trainability, and deployability of the RL solution.

Key Technical Details

Policy gradient theorem (Sutton et al., 2000): $\nabla_{θ} J (θ) = E_{π} [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot Q^{π} (s, a)]$ . This enables gradient-based optimization of stochastic policies.
A stationary policy does not change over time: $π (a ∣ s)$ is independent of $t$ . Optimal policies for infinite-horizon discounted MDPs are always stationary.
Policy entropy $H (π (\cdot ∣ s)) = - \sum_{a} π (a ∣ s) lo g π (a ∣ s)$ measures exploration. Maximum entropy RL (e.g., SAC) adds an entropy bonus to the objective.
In continuous action spaces, the policy typically outputs a squashed Gaussian: $a = tanh (μ_{θ} (s) + σ_{θ} (s) \cdot ϵ)$ , $ϵ \sim N (0, I)$ , to bound actions within a valid range.

Common Misconceptions

"A stochastic policy is always suboptimal." In fully observable MDPs, a deterministic optimal policy always exists. But stochastic policies can be optimal in POMDPs and are necessary in competitive multi-agent games. Furthermore, during training, stochastic policies are essential for exploration.

"Policy-based and value-based methods are fundamentally different." They are deeply connected. Value-based methods implicitly define a policy (greedy w.r.t. $Q$ ). Actor-critic methods explicitly maintain both. The policy gradient theorem itself involves the value function $Q^{π}$ .

"The optimal policy is unique." Multiple policies can be optimal if they achieve the same maximum value in all states but differ in states where multiple actions are equally good (ties in $ar g max_{a} Q^{*} (s, a)$ ).

Connections to Other Concepts

markov-decision-processes.md -- Policies are solutions to MDPs.
value-functions.md -- Value functions evaluate the quality of a policy.
bellman-equations.md -- The Bellman equations characterize value functions under a given policy.
exploration-vs-exploitation.md -- Epsilon-greedy and Boltzmann policies are exploration strategies.
return-and-discount-factor.md -- The policy maximizes the expected return.