Entropy Regularization

One-Line Summary: Adding a policy entropy bonus to the optimization objective to encourage exploration, prevent premature convergence to deterministic policies, and improve robustness -- a simple technique with deep connections to maximum entropy RL.

Prerequisites: Policy gradient methods (policy-gradient-theorem.md), actor-critic methods (actor-critic-methods.md), information-theoretic entropy, softmax policies, exploration-exploitation trade-off.

What Is Entropy Regularization?

Imagine a chess player who discovers that a particular opening works well and begins playing it every single game. She stops exploring other openings, never discovers better strategies, and becomes predictable to opponents. A wise coach might say: "For every game you play your favorite opening, I will give you a small bonus for trying something different." This bonus does not care what alternative she picks -- it just rewards variety itself.

Entropy regularization adds exactly this kind of bonus to the RL objective. The entropy of a policy $π_{θ} (\cdot ∣ s)$ measures how "spread out" or uncertain the action distribution is. A deterministic policy (always choosing one action) has zero entropy. A uniform random policy (equal probability for all actions) has maximum entropy. By adding an entropy bonus to the objective, we reward the policy for maintaining uncertainty, counteracting the natural tendency of policy gradient methods to collapse toward deterministic behavior.

How It Works

The Entropy Bonus

For a discrete policy, the entropy is:

$H [π_{θ} (\cdot ∣ s)] = - \sum_{a} π_{θ} (a ∣ s) lo g π_{θ} (a ∣ s)$

For a continuous Gaussian policy $π_{θ} (\cdot ∣ s) = N (μ_{θ} (s), σ_{θ} (s)^{2})$ :

$H [π_{θ} (\cdot ∣ s)] = \frac{1}{2} lo g (2 π e σ_{θ} (s)^{2})$

The entropy-regularized objective becomes:

$J_{ent} (θ) = E_{τ \sim π_{θ}} [\sum_{t = 0}^{T - 1} (r_{t} + α H [π_{θ} (\cdot ∣ s_{t})])]$

where $α > 0$ is the entropy coefficient (also called the temperature parameter). The policy gradient with entropy regularization is:

$\nabla_{θ} J_{ent} = E_{t} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \hat{A}_{t} + α \nabla_{θ} H [π_{θ} (\cdot ∣ s_{t})]]$

In practice, this is implemented by adding $- α H [π_{θ}]$ to the loss (with the negative sign because we minimize losses but want to maximize entropy).

Why Policies Collapse Without It

Policy gradient methods suffer from a positive feedback loop. Suppose action $a_{1}$ happens to receive a slightly higher return than action $a_{2}$ in some state. The gradient increases $π (a_{1} ∣ s)$ and decreases $π (a_{2} ∣ s)$ . Now $a_{1}$ is sampled more often, generating more data to reinforce it further, while $a_{2}$ is tried less and may never get a fair evaluation. Without intervention, the policy converges to a near-deterministic mode even if better actions exist. Entropy regularization breaks this cycle by continuously penalizing certainty.

The Entropy Coefficient $α$

The coefficient $α$ determines the strength of the exploration incentive:

Too small ( $α \to 0$ ): No exploration benefit. Policy collapses to deterministic behavior. May get trapped in local optima.
Too large ( $α \to \infty$ ): Policy stays near-uniform, ignoring rewards. The agent explores randomly and never exploits good strategies.
Well-tuned ( $α$ ): Policy remains stochastic enough to explore while still concentrating probability on good actions.

Standard values range from $0.001$ to $0.05$ , with $α = 0.01$ being the most common default (as used in A2C/A3C and PPO).

Automatic Entropy Tuning

Manually tuning $α$ is difficult because the appropriate amount of entropy changes during training (more early, less later) and varies across tasks. Haarnoja et al. (2018) introduced automatic entropy adjustment in SAC by solving a dual optimization problem:

$α^{*} = ar g min_{α} E_{a \sim π^{*}} [- α lo g π^{*} (a ∣ s) - α \overset{ˉ}{H}]$

where $\overset{ˉ}{H}$ is a target entropy (typically set to $- dim (A)$ for continuous actions). This adjusts $α$ to maintain a desired level of entropy throughout training, removing a sensitive hyperparameter.

Connection to Maximum Entropy RL

Entropy regularization is the bridge to maximum entropy RL, a principled framework where the agent maximizes the entropy-augmented return:

$π^{*} = ar g max_{π} E_{π} [\sum_{t = 0}^{\infty} γ^{t} (r_{t} + α H [π (\cdot ∣ s_{t})])]$

This framework leads to soft versions of the Bellman equations:

$V^{*} (s) = E_{a \sim π^{*}} [Q^{*} (s, a) - α lo g π^{*} (a ∣ s)]$

$Q^{*} (s, a) = r (s, a) + γ E_{s^{'}} [V^{*} (s^{'})]$

The optimal policy under this framework is the Boltzmann policy:

$π^{*} (a ∣ s) = \frac{e x p ( Q ^{*} ( s , a ) / α )}{Z ( s )}$

This is the theoretical basis of Soft Actor-Critic (SAC), which achieves state-of-the-art performance in continuous control by fully embracing the maximum entropy principle.

Temperature Parameter Interpretation

The entropy coefficient $α$ is often called the temperature by analogy with statistical mechanics. At high temperature ( $α \to \infty$ ), all actions are equally likely (like molecules moving randomly at high energy). At low temperature ( $α \to 0$ ), only the highest-value action is selected (like molecules freezing into a crystal). The temperature continuously interpolates between exploration and exploitation.

Why It Matters

Entropy regularization is present in virtually every modern policy gradient implementation. In A2C, A3C, and PPO, it appears as the $c_{2} H [π_{θ}]$ term in the loss function. In SAC, it is the defining architectural principle. Without entropy regularization, policy gradient methods routinely converge to suboptimal deterministic policies, especially in environments with many local optima or deceptive reward signals. In the RLHF context, entropy regularization (alongside the KL penalty to the reference model) prevents language models from collapsing to repetitive, degenerate outputs during fine-tuning.

Key Technical Details

Default entropy coefficient: $α = 0.01$ for A2C/A3C/PPO with discrete actions. For continuous control, $α$ varies more ( $0.001$ to $0.2$ ) and is often auto-tuned.
Entropy computation: For categorical policies with $K$ actions, maximum entropy is $lo g K$ . For Gaussian policies, entropy depends on $σ$ and is unbounded above.
Gradient of entropy: For a categorical policy, $\nabla_{θ} H = - \nabla_{θ} \sum_{a} π_{θ} (a ∣ s) [lo g π_{θ} (a ∣ s) + 1]$ . Most deep learning frameworks compute this automatically.
Entropy decay: Some implementations anneal $α$ from a larger value to a smaller one during training, encouraging more exploration early and more exploitation later.
Numerical stability: When $π_{θ} (a ∣ s) \approx 0$ , the term $π lo g π$ can cause numerical issues. Adding a small constant (e.g., $1 0^{- 8}$ ) inside the log prevents NaN values.
Entropy regularization is not the same as epsilon-greedy: Epsilon-greedy exploration is uniform over random actions. Entropy regularization smoothly distributes probability, favoring near-optimal actions while maintaining stochasticity.
In multi-dimensional continuous action spaces, the entropy bonus applies independently per dimension for factored Gaussian policies.

Common Misconceptions

"Entropy regularization makes the agent random." With appropriate $α$ , the policy is stochastic but still concentrated on good actions. The entropy bonus prevents collapsing to a single action, not from preferring good actions. A well-tuned entropy-regularized policy is far from uniform.
"Entropy regularization is only about exploration." It also improves robustness (stochastic policies handle environmental perturbations better), prevents overfitting to reward model artifacts (critical in RLHF), and leads to better optimization landscapes (smoother objectives).
"You always need entropy regularization." In some environments with simple reward landscapes and no local optima, policies converge fine without it. However, the computational cost is negligible, so it is almost always included as a safeguard.
"SAC is just actor-critic with entropy." SAC's maximum entropy framework changes the Bellman equations, the policy optimization target, and the value function definitions. It is a fundamentally different framework, not just an add-on.
"Higher entropy always means better exploration." A uniform policy has maximum entropy but explores very inefficiently. Effective exploration requires directing probability toward informative actions, not just being random.

Connections to Other Concepts

actor-critic-methods.md -- Entropy regularization is a standard component of the actor's loss in all actor-critic methods.
proximal-policy-optimization.md -- PPO includes $c_{2} H [π_{θ}]$ in its combined loss function. The entropy coefficient interacts with the clipping parameter.
a2c-and-a3c.md -- The entropy term $c_{2} = 0.01$ in the A2C/A3C loss function is entropy regularization.
advantage-estimation.md -- The entropy bonus modifies the effective advantage, adding a state-dependent exploration incentive to the standard advantage.
trust-region-methods.md -- Trust region methods and entropy regularization both prevent destructive policy changes, but through different mechanisms (step size vs. distribution shape).