Eligibility Traces

One-Line Summary: Credit assignment mechanism that blends TD and Monte Carlo through exponentially decaying memory of visited states.

Prerequisites: temporal-difference-learning.md, monte-carlo-methods.md, return-and-discount-factor.md.

What Are Eligibility Traces?

Imagine you're training a dog. The dog performs a sequence of tricks -- sit, shake, roll over -- and you give it a treat at the end. Which trick earned the treat? The most recent one (roll over) deserves the most credit, but the earlier ones contributed too. You might assign decreasing credit backward through time: roll over gets the most, shake gets some, and sit gets a little.

Eligibility traces implement exactly this idea in RL. They maintain a decaying memory of which states were recently visited, so when a reward signal arrives (via the TD error), credit is distributed backward to all recently visited states -- not just the immediately preceding one. This bridges the gap between TD learning (which updates only one step back) and Monte Carlo methods (which wait until the episode ends).

How It Works

The Forward View: $λ$ -Return

The $λ$ -return is a weighted average of all $n$ -step returns:

$G_{t}^{λ} = (1 - λ) \sum_{n = 1}^{\infty} λ^{n - 1} G_{t}^{(n)}$

where $G_{t}^{(n)} = R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} V (S_{t + n})$ is the $n$ -step return.

The geometric weighting $(1 - λ) λ^{n - 1}$ sums to 1, creating a valid average. This smoothly interpolates:

$λ = 0$ : Only the 1-step return $G_{t}^{(1)}$ -- equivalent to TD(0)
$λ = 1$ : The full Monte Carlo return $G_{t}$ (no bootstrapping)
$0 < λ < 1$ : A blend, with exponentially decreasing weight on longer returns

The Backward View: Eligibility Traces

Computing the $λ$ -return in the forward view requires waiting until the end of the episode. The backward view achieves the same result incrementally using eligibility traces.

Each state $s$ has an eligibility trace $e_{t} (s)$ that tracks how recently and frequently state $s$ was visited:

Accumulating traces:

$e_{t} (s) = {γ λ e_{t - 1} (s) + 1 γ λ e_{t - 1} (s) if s = S_{t} otherwise$

Replacing traces:

$e_{t} (s) = {1 γ λ e_{t - 1} (s) if s = S_{t} otherwise$

The trace decays by $γ λ$ at each step and gets a boost when the state is visited. Replacing traces cap the trace at 1 upon revisiting, preventing inflated updates in loops.

TD( $λ$ ) Algorithm

At each time step:

Observe transition $(S_{t}, A_{t}, R_{t + 1}, S_{t + 1})$
Compute TD error: $δ_{t} = R_{t + 1} + γ V (S_{t + 1}) - V (S_{t})$
Update trace: $e_{t} (s)$ for all states
Update all state values: $V (s) \leftarrow V (s) + α δ_{t} e_{t} (s)$ for all $s$

The key insight: the same TD error $δ_{t}$ is used to update every state, but weighted by its eligibility trace. States visited long ago have decayed traces and receive small updates. The state just visited has a trace near 1 and receives the full update.

SARSA( $λ$ ) and Q( $λ$ )

Eligibility traces extend naturally to control:

SARSA( $λ$ ): Maintain traces over state-action pairs $(s, a)$ :

$e_{t} (s, a) = {γ λ e_{t - 1} (s, a) + 1 γ λ e_{t - 1} (s, a) if s = S_{t}, a = A_{t} otherwise$

$Q (s, a) \leftarrow Q (s, a) + α δ_{t} e_{t} (s, a) \forall (s, a)$

Watkins's Q( $λ$ ): Combines Q-learning with traces, but cuts the trace to zero whenever a non-greedy action is taken (because Q-learning is off-policy, and the trace should only credit the greedy path):

$e_{t} (s, a) = ⎩ ⎨ ⎧ 0 γ λ e_{t - 1} (s, a) + 1 γ λ e_{t - 1} (s, a) if A_{t} \neq = ar g max_{a^{'}} Q (S_{t}, a^{'}) (non-greedy) if s = S_{t}, a = A_{t} otherwise$

The Equivalence

Sutton & Barto (2018) prove that the forward view ( $λ$ -return) and backward view (eligibility traces) produce identical total updates over an episode (for the offline/batch case). The backward view is computationally preferable because it updates incrementally at each step.

Why It Matters

Eligibility traces provide a unifying framework for TD and Monte Carlo methods. Rather than choosing between TD(0) (low variance, high bias) and Monte Carlo (high variance, zero bias), the practitioner can tune $λ$ to find the optimal bias-variance trade-off for their problem.

In practice, intermediate $λ$ values (0.8-0.95) often outperform both extremes. Traces propagate reward information faster than TD(0) without waiting for full episodes like Monte Carlo, making learning significantly more sample-efficient in many environments.

Key Technical Details

Computational cost: TD( $λ$ ) requires storing and updating a trace for every state (or state-action pair) at each step, making it $O (∣ S ∣)$ per step instead of $O (1)$ for TD(0). This motivated truncated traces and sparse implementations.
Common $λ$ values: 0.8-0.95 typically works best. $λ = 0.9$ is a common default.
Replacing vs accumulating traces: Replacing traces often work better in practice, especially in environments with loops or repeated state visits. Singh & Sutton (1996) showed replacing traces can be significantly faster.
In deep RL, eligibility traces are less commonly used because experience replay (which randomly samples transitions) breaks the temporal structure that traces rely on. GAE (Generalized Advantage Estimation) is the modern equivalent, computing $λ$ -weighted advantages for policy gradient methods.
GAE connection: The Generalized Advantage Estimation used in PPO and A2C is mathematically equivalent to using a $λ$ -return for advantage estimation: $\hat{A}_{t}^{GAE (γ, λ)} = \sum_{l = 0}^{\infty} (γ λ)^{l} δ_{t + l}$

Common Misconceptions

"Eligibility traces are just a historical curiosity." While less prominent in modern deep RL, the $λ$ -return idea lives on as GAE, which is central to PPO and modern policy gradient methods. Understanding traces is essential for understanding GAE.
" $λ = 1$ is always equivalent to Monte Carlo." This is true only for episodic tasks. For continuing tasks, $λ = 1$ can cause issues because traces never fully decay.
"Traces require storing all visited states." In practice, traces below a threshold can be zeroed out (sparse traces), and only recently visited states need active traces.
"The backward view is an approximation." It produces identical total updates to the forward view over a complete episode (Sutton & Barto, 2018, Chapter 12).

Connections to Other Concepts

temporal-difference-learning.md -- TD(0) is the special case $λ = 0$ .
monte-carlo-methods.md -- Monte Carlo is the special case $λ = 1$ .
n-step-methods.md -- $n$ -step returns are individual components of the $λ$ -return mixture.
advantage-estimation.md -- GAE is the modern descendant of eligibility traces for policy gradient methods.
q-learning.md -- Q( $λ$ ) extends Q-learning with traces but requires trace-cutting for off-policy correctness.