REINFORCE

One-Line Summary: The simplest policy gradient algorithm -- sample a complete trajectory, weight each action's log-probability by the return that followed it, and update the policy in the direction that reinforces successful behavior.

Prerequisites: Policy Gradient Theorem (policy-gradient-theorem.md), Monte Carlo estimation, stochastic gradient ascent, parameterized policies $π_{θ} (a ∣ s)$ .

What Is REINFORCE?

Imagine a stand-up comedian trying new material. She performs an entire set (a complete episode), notes which jokes got laughs and which fell flat, then adjusts her routine accordingly. Jokes that produced big laughs get repeated more often; jokes that bombed get dropped. She has to finish the whole set before she can assess anything -- she cannot update mid-performance. This is REINFORCE: a complete-episode, Monte Carlo policy gradient method.

REINFORCE, introduced by Ronald Williams in 1992, is the most direct instantiation of the policy gradient theorem. It collects a full trajectory, computes the return from each time step, and updates the policy parameters to increase the probability of actions that led to high returns.

How It Works

The Algorithm

Initialize policy parameters $θ$ .
Repeat:
- Sample a complete trajectory $τ = (s_{0}, a_{0}, r_{0}, \dots, s_{T - 1}, a_{T - 1}, r_{T - 1})$ by following $π_{θ}$ .
- For each time step $t$ , compute the return: $G_{t} = \sum_{k = t}^{T - 1} γ^{k - t} r_{k}$ .
- Update parameters:

$θ \leftarrow θ + α \sum_{t = 0}^{T - 1} γ^{t} G_{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$

That is it. No value function, no bootstrapping, no replay buffer. Pure Monte Carlo policy gradient.

The High Variance Problem

REINFORCE suffers from extreme variance. Consider why: the return $G_{t}$ is a single random sample of the expected future reward. In a stochastic environment, two identical actions in the same state can produce wildly different returns due to randomness in transitions and future actions. This noise propagates directly into the gradient estimate.

Variance scales with trajectory length, reward magnitude, and stochasticity of the environment. In practice, raw REINFORCE can require millions of episodes to converge on even simple tasks.

Baseline Subtraction

The most important variance reduction technique is subtracting a baseline $b (s_{t})$ from the return:

$\nabla_{θ} J (θ) = E_{τ} [\sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot (G_{t} - b (s_{t}))]$

Crucially, any baseline that depends only on the state (not the action) leaves the gradient unbiased. This follows from:

$E_{a \sim π_{θ}} [b (s) \nabla_{θ} lo g π_{θ} (a ∣ s)] = b (s) \nabla_{θ} \sum_{a} π_{θ} (a ∣ s) = b (s) \nabla_{θ} 1 = 0$

The optimal baseline in the minimum-variance sense is a weighted average of returns, but in practice, the state-value function $V (s_{t})$ is used. This transforms the raw return into an advantage-like quantity $G_{t} - V (s_{t})$ : how much better was this trajectory than expected?

REINFORCE with Baseline (Pseudocode)

Initialize policy parameters $θ$ and value parameters $ϕ$ .
Repeat:
- Sample trajectory $τ$ under $π_{θ}$ .
- Compute returns $G_{t}$ for each $t$ .
- Update value function: minimize $\sum_{t} (V_{ϕ} (s_{t}) - G_{t})^{2}$ .
- Update policy: $θ \leftarrow θ + α \sum_{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) (G_{t} - V_{ϕ} (s_{t}))$ .

Why It Matters

REINFORCE is the conceptual starting point for all policy gradient methods. Understanding its strengths (simplicity, unbiasedness, generality) and weaknesses (high variance, sample inefficiency) motivates every subsequent algorithm in the policy gradient lineage. Actor-critic methods reduce its variance by bootstrapping. PPO constrains its update step for stability. But the core mechanism -- reinforcing good actions via the score function -- remains unchanged throughout the entire family.

Key Technical Details

REINFORCE is on-policy: each trajectory is used for exactly one gradient update, then discarded. This is highly sample-inefficient.
The $γ^{t}$ factor in front of $G_{t}$ is often omitted in practice (using undiscounted weighting of the gradient), which changes the objective but can improve empirical performance.
Typical learning rates range from $1 0^{- 4}$ to $1 0^{- 2}$ depending on the task and parameterization.
Batch REINFORCE collects $N$ trajectories before computing a single gradient update, reducing variance by a factor of $N$ but requiring $N$ times more samples per update.
Reward normalization (subtracting mean and dividing by standard deviation of returns across a batch) is a practical heuristic that acts as an adaptive baseline.
For discrete action spaces, $π_{θ}$ is typically a softmax over logits. For continuous actions, it is commonly a Gaussian $N (μ_{θ} (s), σ_{θ} (s)^{2})$ .

Common Misconceptions

"REINFORCE with a baseline becomes actor-critic." Not quite. REINFORCE with a baseline still uses Monte Carlo returns $G_{t}$ . Actor-critic methods replace $G_{t}$ with bootstrapped estimates (e.g., TD targets), introducing bias but substantially reducing variance. The distinction is Monte Carlo vs. bootstrapping.
"The baseline must be the value function." Any state-dependent function works. A constant baseline (e.g., the running average of returns) already helps significantly. The value function is simply the most effective common choice.
"REINFORCE cannot work in practice." It can and does, especially for short-horizon problems or when combined with variance reduction. Williams' original experiments and many modern hyperparameter search methods use REINFORCE-style updates.
"More trajectories per batch always helps." There are diminishing returns. Variance decreases as $O (1/ N)$ , so doubling the batch from 1000 to 2000 trajectories only reduces standard deviation by about 30%.

Connections to Other Concepts

policy-gradient-theorem.md -- The theoretical result that REINFORCE directly implements.
actor-critic-methods.md -- The natural evolution of REINFORCE that introduces a learned critic to reduce variance via bootstrapping.
advantage-estimation.md -- Generalizes the baseline subtraction idea into the full GAE framework.
entropy-regularization.md -- Often added to the REINFORCE objective to prevent premature convergence in the absence of a critic.