Policy Gradient Theorem

One-Line Summary: The mathematical foundation that enables direct optimization of parameterized policies via gradient ascent on expected return, bypassing the need to differentiate through unknown environment dynamics.

Prerequisites: Parameterized policies $π_{θ} (a ∣ s)$ , expected return $J (θ)$ , basic calculus (chain rule, logarithmic derivatives), Markov Decision Processes, stochastic gradient ascent.

What Is the Policy Gradient Theorem?

Imagine you are a blindfolded chef trying to perfect a recipe. You cannot see the stove, the ingredients, or the chemical reactions happening in the pan -- but you can taste the final dish and remember what actions you took. The Policy Gradient Theorem tells you exactly how to adjust your cooking habits based solely on the outcomes you observe, without needing to understand the physics of cooking.

In reinforcement learning, we want to find a policy $π_{θ} (a ∣ s)$ that maximizes expected cumulative reward. The naive approach would require differentiating through the environment's transition dynamics $P (s^{'} ∣ s, a)$ , which are unknown. The Policy Gradient Theorem sidesteps this entirely: it expresses the gradient of expected return purely in terms of quantities the agent can sample -- action probabilities and observed rewards.

How It Works

The Objective

We seek to maximize the expected return under policy $π_{θ}$ :

$J (θ) = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T} γ^{t} r_{t}]$

where $τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots)$ is a trajectory sampled under $π_{θ}$ .

Why Differentiation Is Non-Trivial

Expanding $J (θ)$ reveals the problem. The probability of a trajectory is:

$P (τ ∣ θ) = ρ (s_{0}) \prod_{t = 0}^{T - 1} π_{θ} (a_{t} ∣ s_{t}) \cdot P (s_{t + 1} ∣ s_{t}, a_{t})$

Differentiating $J (θ) = \sum_{τ} P (τ ∣ θ) R (τ)$ requires $\nabla_{θ} P (τ ∣ θ)$ , which involves the unknown dynamics $P (s^{'} ∣ s, a)$ . We cannot differentiate through what we do not know.

The Log-Probability Trick

The key identity is deceptively simple. For any function $f (x)$ :

$\nabla_{θ} P (τ ∣ θ) = P (τ ∣ θ) \nabla_{θ} lo g P (τ ∣ θ)$

This follows from $\nabla lo g f = \frac{\nabla f}{f}$ . Substituting into the gradient of $J$ :

$\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [R (τ) \nabla_{θ} lo g P (τ ∣ θ)]$

The Dynamics Cancel

Taking the log of the trajectory probability:

$lo g P (τ ∣ θ) = lo g ρ (s_{0}) + \sum_{t = 0}^{T - 1} [lo g π_{θ} (a_{t} ∣ s_{t}) + lo g P (s_{t + 1} ∣ s_{t}, a_{t})]$

When we differentiate with respect to $θ$ , the initial state distribution $ρ (s_{0})$ and the transition dynamics $P (s^{'} ∣ s, a)$ vanish because they do not depend on $θ$ . This yields the Policy Gradient Theorem:

$\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot R_{t}]$

where $R_{t} = \sum_{t^{'} = t}^{T - 1} γ^{t^{'} - t} r_{t^{'}}$ is the return from time $t$ onward. Each action's log-probability is weighted by the reward that followed it.

The Score Function Estimator

The term $\nabla_{θ} lo g π_{θ} (a ∣ s)$ is called the score function. It indicates the direction in parameter space that increases the probability of action $a$ in state $s$ . The theorem says: move in this direction, but scale the step by how good the outcome was. Good outcomes reinforce the actions that led to them; bad outcomes suppress them.

Why It Matters

The Policy Gradient Theorem is the theoretical bedrock upon which every policy gradient algorithm rests -- from REINFORCE to PPO to the RLHF systems training modern large language models. Without it, we would need a differentiable model of the environment, which is unavailable in most real-world problems (robotics, game playing, language model alignment). It converts an intractable optimization problem into a tractable one solvable with Monte Carlo sampling.

Key Technical Details

The theorem holds for both episodic (finite horizon) and continuing (infinite horizon with discounting or average reward) settings.
The score function $\nabla_{θ} lo g π_{θ} (a ∣ s)$ has expectation zero: $E_{a \sim π_{θ}} [\nabla_{θ} lo g π_{θ} (a ∣ s)] = 0$ . This allows subtracting any state-dependent baseline without introducing bias.
For Gaussian policies $π_{θ} (a ∣ s) = N (μ_{θ} (s), σ^{2})$ , the score function is $\frac{( a - μ _{θ} ( s ))}{σ ^{2}} \nabla_{θ} μ_{θ} (s)$ .
For softmax (categorical) policies, the score function is $\nabla_{θ} lo g π_{θ} (a ∣ s) = ϕ (s, a) - E_{a^{'} \sim π_{θ}} [ϕ (s, a^{'})]$ , where $ϕ$ are features.
The raw Monte Carlo estimator of the policy gradient has notoriously high variance, motivating baselines (advantage-estimation.md) and critic networks (actor-critic-methods.md).

Common Misconceptions

"The policy gradient theorem requires a differentiable environment." False -- this is precisely what the theorem avoids. The environment is treated as a black box; only the policy must be differentiable with respect to $θ$ .
"Policy gradients only work for discrete action spaces." The theorem applies to continuous actions equally well. Gaussian policies are the standard choice for continuous control.
"The log-probability trick is just a mathematical convenience." It is fundamental. Without it, the gradient estimator would require knowledge of $P (s^{'} ∣ s, a)$ , making model-free policy optimization impossible.
"Higher returns always mean larger gradient updates." The gradient also depends on the score function magnitude. Actions the policy already takes with high confidence produce smaller score function values, creating a natural dampening effect.

Connections to Other Concepts

reinforce.md -- The simplest algorithm built directly on the policy gradient theorem using Monte Carlo returns.
actor-critic-methods.md -- Replaces Monte Carlo returns with learned value function estimates to reduce variance.
advantage-estimation.md -- Provides sophisticated estimators for the $R_{t}$ term in the gradient to control bias-variance trade-off.
trust-region-methods.md -- Addresses the fact that the theorem gives a local gradient direction but says nothing about safe step sizes.
proximal-policy-optimization.md -- The practical descendant that clips the gradient-based update for stability.