One-Line Summary: Bridging Monte Carlo and TD by bootstrapping after steps -- tunable bias-variance trade-off.

Prerequisites: temporal-difference-learning.md, monte-carlo-methods.md, return-and-discount-factor.md, value-functions.md

What Is N-Step Methods?

Imagine predicting the weather. A one-step forecast ("what will tomorrow be like based on today?") is low variance but relies heavily on your current model's accuracy. A full-season forecast ("average of all days this season") uses real data but is noisy. An -day forecast -- looking ahead a week before trusting your model -- balances these extremes. N-step methods in RL work identically: instead of bootstrapping after one step (TD) or waiting until the end of the episode (MC), they wait steps, using real observed rewards for those steps and then bootstrapping from the estimated value at step .

N-step methods provide a continuous, tunable bridge between TD(0) at one extreme () and Monte Carlo at the other (). The parameter controls the bias-variance trade-off: small gives low variance but higher bias (from bootstrapping off potentially inaccurate estimates), while large gives lower bias but higher variance (from accumulating stochastic rewards).

How It Works

The N-Step Return

The -step return from time is defined as:

More compactly:

Special cases reveal the unifying nature:

  • : -- the TD(0) target.
  • (remaining episode length): -- the full MC return (with for terminal states).

N-Step TD Prediction

The value update uses the -step return as the target:

This update can only be made at time (when and are known), introducing a delay of steps between experiencing a transition and updating the value.

N-Step TD Prediction:
Initialize V(s) arbitrarily
For each episode:
    Store S_0; T <- infinity
    For t = 0, 1, 2, ...:
        If t < T:
            Take action, observe R_{t+1}, S_{t+1}
            If S_{t+1} is terminal: T <- t + 1
        tau <- t - n + 1  (time of state being updated)
        If tau >= 0:
            G <- sum_{k=tau+1}^{min(tau+n, T)} gamma^{k-tau-1} * R_k
            If tau + n < T: G <- G + gamma^n * V(S_{tau+n})
            V(S_tau) <- V(S_tau) + alpha * [G - V(S_tau)]
    until tau = T - 1

Recommended visual: Performance of -step TD on the 19-state random walk as a function of and . Source: Sutton & Barto, Reinforcement Learning: An Introduction, Figure 7.2.

N-Step SARSA

Extending to control, -step SARSA uses action-value -step returns:

The update becomes:

This naturally extends to -step Expected SARSA by replacing with .

N-Step Off-Policy Learning

For off-policy -step methods, importance sampling ratios correct for the distribution mismatch:

The update becomes:

The product of importance sampling ratios can have high variance for large , which is a practical limitation of off-policy -step methods.

The Bias-Variance Trade-Off

The choice of controls a fundamental trade-off:

PropertySmall (TD-like)Large (MC-like)
BiasHigher (bootstraps from estimates)Lower (uses more real rewards)
VarianceLower (fewer random variables)Higher (product of many random rewards)
Speed of propagationSlow (one step at a time)Fast (information travels steps per update)
Sensitivity to initial valuesHigher (bootstraps from them)Lower (actual returns dominate)

The optimal is problem-dependent. The classic result from Sutton and Barto's 19-state random walk shows that intermediate values of (around 4-8) consistently outperform both extremes across a range of learning rates. This demonstrates that the best trade-off is rarely at either extreme.

Truncated N-Step Returns

Near episode boundaries, when fewer than steps remain, the return is naturally truncated:

This graceful degradation means -step methods automatically become MC-like near episode termination, using whatever steps are available.

Why It Matters

N-step methods formalize the intuition that neither pure TD nor pure MC is universally optimal. They provide practitioners with a tunable knob () to adapt to the specific characteristics of their problem: use smaller for problems with accurate value estimates and high reward variance, and larger for problems with poor initial estimates or strong sequential dependencies. N-step returns also serve as the conceptual stepping stone to eligibility traces, which elegantly average over all values of simultaneously.

Key Technical Details

  • Optimal : Problem-dependent, but frequently outperforms TD(0) and MC in tabular benchmarks.
  • Memory requirement: Must store the last states, actions, and rewards, requiring additional memory per episode.
  • Update delay: Values for cannot be updated until time , delaying credit assignment. This means the first transitions of each episode produce no updates.
  • Computational cost: Each update costs to compute the -step return, compared to for TD(0).
  • Off-policy instability: The product of importance sampling ratios grows exponentially in variance with , making off-policy -step methods impractical for large .
  • Convergence: -step TD converges to under the same conditions as TD(0), with the rate depending on , , and the environment structure.

Common Misconceptions

  • "Larger is always better because it reduces bias." While larger reduces bias from bootstrapping, it increases variance from reward stochasticity. The optimal balances these effects and is almost never (full MC).
  • "N-step methods are just a theoretical construct." N-step returns are directly used in practical algorithms. A3C uses -step returns (typically or ) as its primary update target. Many deep RL implementations use -step returns for improved sample efficiency.
  • "You must choose a single in advance." Eligibility traces (TD()) and compound returns allow effectively averaging over multiple values of , but even fixed can be tuned as a hyperparameter.

Connections to Other Concepts

  • temporal-difference-learning.md -- TD(0) is the special case . N-step methods generalize TD by deferring bootstrapping.
  • monte-carlo-methods.md -- MC is the special case (or ). N-step methods recover MC when equals the remaining episode length.
  • eligibility-traces.md -- TD() can be viewed as a geometric average over all -step returns, providing a more elegant way to trade off bias and variance.
  • sarsa.md -- N-step SARSA extends SARSA by using -step returns, often substantially improving performance.
  • q-learning.md -- N-step Q-learning (e.g., "tree backup" algorithm) extends Q-learning to multi-step off-policy updates.
  • a2c-and-a3c.md -- A3C uses -step returns as its default update target in deep RL, typically with .

Further Reading

  1. "Reinforcement Learning: An Introduction," Chapter 7 (Sutton & Barto, 2018) -- The definitive treatment of -step methods, including prediction, control, off-policy corrections, and the random walk experiments.
  2. "Multi-Step Reinforcement Learning: A Unifying Algorithm" (De Asis et al., 2018) -- Proposes a unified -step algorithm encompassing Q-learning, SARSA, Expected SARSA, and Tree Backup as special cases.
  3. "Asynchronous Methods for Deep Reinforcement Learning" (Mnih et al., 2016) -- The A3C paper that popularized -step returns in deep RL, demonstrating their effectiveness with neural network function approximation.
  4. "Safe and Efficient Off-Policy Reinforcement Learning" (Munos et al., 2016) -- Introduces Retrace(), addressing the high variance of importance sampling in multi-step off-policy learning.