Actor-Critic Methods

One-Line Summary: A two-network architecture that combines a policy (the actor) with a learned value function (the critic) to reduce the high variance of pure policy gradient methods while maintaining low bias.

Prerequisites: REINFORCE (reinforce.md), Policy Gradient Theorem (policy-gradient-theorem.md), Temporal Difference learning, value function approximation, bias-variance trade-off.

What Is Actor-Critic?

Imagine a figure skater and her coach. The skater (the actor) performs routines and makes real-time decisions about jumps, spins, and footwork. The coach (the critic) watches each move and provides immediate feedback: "that triple axel was above your average -- do more of that" or "that spin was below your usual standard -- adjust your technique." The skater does not have to wait until the end of the entire program to learn. She improves move by move, guided by the coach's running evaluation.

In RL terms, the actor is a parameterized policy $π_{θ} (a ∣ s)$ that selects actions. The critic is a learned value function $V_{ϕ} (s)$ (or $Q_{ϕ} (s, a)$ ) that evaluates how good the current situation is. The critic's feedback replaces the noisy Monte Carlo returns used in REINFORCE with lower-variance, bootstrapped estimates.

How It Works

The Core Idea: Bootstrapped Policy Gradients

REINFORCE uses the full Monte Carlo return $G_{t}$ to weight the policy gradient. Actor-critic replaces this with a bootstrapped target based on the TD error:

$δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t})$

The TD error $δ_{t}$ is an unbiased estimate of the advantage $A (s_{t}, a_{t})$ when $V_{ϕ} = V^{π}$ (the true value function). The policy gradient becomes:

$\nabla_{θ} J (θ) \approx E [\sum_{t = 0}^{T - 1} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot δ_{t}]$

One-Step Actor-Critic Algorithm

Initialize actor parameters $θ$ and critic parameters $ϕ$ .
Observe initial state $s_{0}$ .
For each time step $t$ :
- Sample action $a_{t} \sim π_{θ} (\cdot ∣ s_{t})$ .
- Execute $a_{t}$ , observe reward $r_{t}$ and next state $s_{t + 1}$ .
- Compute TD error: $δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t})$ .
- Update critic: $ϕ \leftarrow ϕ + α_{ϕ} δ_{t} \nabla_{ϕ} V_{ϕ} (s_{t})$ .
- Update actor: $θ \leftarrow θ + α_{θ} δ_{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$ .

This is an online algorithm: it updates after every single transition, not after complete episodes.

The Bias-Variance Trade-off vs. REINFORCE

Property	REINFORCE	Actor-Critic
Return estimate	Monte Carlo $G_{t}$ (unbiased, high variance)	Bootstrapped $δ_{t}$ (biased, lower variance)
Update frequency	End of episode	Every time step
Sample efficiency	Low (full episodes wasted)	Higher (learns from each transition)
Requires critic	No (optional baseline)	Yes
Bias source	None	Imperfect value function $V_{ϕ} \neq = V^{π}$

The critic introduces bias because $V_{ϕ}$ is an approximation. Early in training, when $V_{ϕ}$ is poor, the bias can be substantial. However, the variance reduction is so significant that actor-critic methods almost always converge faster in practice.

N-Step Actor-Critic

The one-step TD error uses a single reward before bootstrapping. We can interpolate between pure TD (one step) and pure Monte Carlo (all steps) using n-step returns:

$G_{t}^{(n)} = \sum_{k = 0}^{n - 1} γ^{k} r_{t + k} + γ^{n} V_{ϕ} (s_{t + n})$

$δ_{t}^{(n)} = G_{t}^{(n)} - V_{ϕ} (s_{t})$

Larger $n$ reduces bias (more real rewards, less reliance on the imperfect critic) but increases variance (more randomness in the sampled trajectory). This idea is generalized fully by GAE in advantage-estimation.md.

Shared vs. Separate Networks

A practical design choice is whether the actor and critic share neural network layers. Shared representations can improve data efficiency (both learn useful features), but can also create harmful gradient interference -- the critic's loss landscape may push shared features in directions that harm the actor. Common architectures use a shared trunk with separate heads for policy logits and value predictions.

Why It Matters

Actor-critic methods are the backbone of nearly all modern policy gradient algorithms. A3C, A2C, PPO, SAC, and IMPALA are all actor-critic methods at their core. The actor-critic paradigm resolved the central practical limitation of REINFORCE (high variance) while preserving the ability to optimize stochastic policies directly. Without this architecture, policy gradient methods would be too sample-inefficient for complex tasks like robotic manipulation, game playing, or language model alignment.

Key Technical Details

Separate learning rates for actor ( $α_{θ}$ ) and critic ( $α_{ϕ}$ ) are standard. The critic typically uses a higher learning rate (e.g., $3 \times 1 0^{- 4}$ ) than the actor (e.g., $1 \times 1 0^{- 4}$ ) because a good critic accelerates actor learning.
The critic loss is typically MSE: $L_{ϕ} = \frac{1}{2} (V_{ϕ} (s_{t}) - G_{t}^{target})^{2}$ , sometimes clipped to prevent large updates.
Entropy regularization (entropy-regularization.md) is almost always added to the actor's objective in practice.
The critic can estimate $V (s)$ , $Q (s, a)$ , or $A (s, a)$ . V-function critics are most common because they require no action input, simplifying the architecture.
Gradient clipping (e.g., max norm of 0.5) is standard to prevent destabilizing updates from rare, high-magnitude TD errors.

Common Misconceptions

"Actor-critic is a specific algorithm." It is an architecture and design pattern, not a single algorithm. A2C, A3C, PPO, SAC, and TD3 are all actor-critic algorithms with very different update rules.
"The critic's only purpose is to reduce variance." The critic also enables online learning (no need to wait for episode termination), handles continuing (non-episodic) tasks naturally, and provides the advantage estimates needed by advanced methods like PPO.
"Bias from bootstrapping is always harmful." In practice, the bias from a reasonably trained critic is small and decreasing over time, while the variance reduction is immediate and large. The net effect is strongly positive.
"Shared actor-critic networks are always better." Shared networks can suffer from conflicting gradient signals. Many high-performance implementations (e.g., OpenAI's PPO for RLHF) use separate networks.

Connections to Other Concepts

reinforce.md -- The pure Monte Carlo predecessor. Actor-critic can be viewed as REINFORCE with bootstrapped returns replacing Monte Carlo returns.
advantage-estimation.md -- GAE provides the sophisticated advantage estimates used in modern actor-critic implementations.
a2c-and-a3c.md -- Specific parallel actor-critic architectures that scale the paradigm to multiple workers.
proximal-policy-optimization.md -- The most widely used actor-critic algorithm, adding clipped surrogate objectives for stability.
entropy-regularization.md -- A critical addition to the actor's loss to maintain exploration in actor-critic training.