Double DQN

One-Line Summary: Decoupling action selection from evaluation to correct DQN's systematic overestimation of Q-values.

Prerequisites: q-learning.md, deep-q-networks.md, target-networks.md, bellman-equations.md

What Is Double DQN?

Imagine you are choosing a restaurant. You ask friends for recommendations and always pick the friend whose rating is highest. But some friends are enthusiastic exaggerators -- they rate everything highly. By always picking the maximum rating, you systematically end up at overhyped restaurants. A smarter approach: have one friend suggest the restaurant, and a different friend rate it. This decoupling eliminates the bias from always chasing the most optimistic estimate. Double DQN applies exactly this logic to Q-learning, using one network to select the best action and a separate network to evaluate that action's value.

Standard Q-learning and DQN use a $max$ operator that simultaneously selects the best action and evaluates its value using the same Q-function. When Q-values contain estimation noise (as they inevitably do with function approximation), the max operator preferentially selects overestimated actions, creating a systematic upward bias. Double DQN corrects this with a minimal change to the DQN algorithm.

How It Works

The Maximization Bias Problem

In standard Q-learning, the TD target is:

$y^{D QN} = r + γ max_{a^{'}} Q (s^{'}, a^{'}; w^{-})$

The $max$ operator does double duty: it both selects the best action ( $ar g max$ ) and evaluates it (the Q-value at that action). When Q-values contain noise $ϵ_{a}$ , the expected value of the max is biased upward:

$E [max_{a} (Q (s, a) + ϵ_{a})] \geq max_{a} Q (s, a)$

This follows from Jensen's inequality applied to the convex max function. The bias grows with the number of actions and the variance of the noise. In early training, when Q-estimates are highly noisy, this overestimation can be severe -- Q-values inflate to unrealistic levels, causing the agent to commit to suboptimal actions based on noise rather than signal.

The Double Q-Learning Idea

van Hasselt (2010) proposed the core insight: use two independent Q-functions, $Q_{A}$ and $Q_{B}$ . One selects the action, the other evaluates it:

$y_{A} = r + γ Q_{B} (s^{'}, ar g max_{a^{'}} Q_{A} (s^{'}, a^{'}))$

Because the selection and evaluation use different estimators with independent noise, the overestimation bias is eliminated in expectation. Even if $Q_{A}$ selects a noisy action, $Q_{B}$ provides an unbiased evaluation (and vice versa). The original tabular Double Q-learning maintained two separate Q-tables, updated alternately.

Double DQN Implementation

van Hasselt, Guez, and Silver (2016) realized that DQN already has two networks -- the online network $w$ and the target network $w^{-}$ -- and the fix requires changing only a single line. Instead of:

$y^{D QN} = r + γ max_{a^{'}} Q (s^{'}, a^{'}; w^{-})$

Double DQN computes:

$y^{D D QN} = r + γ Q (s^{'}, ar g max_{a^{'}} Q (s^{'}, a^{'}; w); w^{-})$

The key difference: the online network $w$ selects the action ( $ar g max$ ), but the target network $w^{-}$ evaluates that action's value. This decouples selection from evaluation using the two networks that DQN already maintains, adding zero computational overhead.

Recommended visual: Side-by-side comparison of DQN vs. Double DQN target computation, highlighting which network performs selection vs. evaluation. See Figure 1 in van Hasselt et al. (2016).

A Simple Example

Consider a state $s^{'}$ with three actions and true Q-values $[1.0, 1.0, 1.0]$ (all equal). With noise, the online network estimates $[1.3, 0.8, 1.1]$ and the target network estimates $[0.9, 1.2, 0.7]$ .

DQN target: Uses the target network for both selection and evaluation. $max$ of $[0.9, 1.2, 0.7] = 1.2$ (action 2). Overestimates by 0.2.
Double DQN target: Online network selects action 1 ( $ar g max$ of $[1.3, 0.8, 1.1]$ ). Target network evaluates action 1: $0.9$ . Underestimates by 0.1.

Neither is perfect on a single sample, but over many samples Double DQN's errors average to approximately zero, while DQN's errors are systematically positive.

Why It Matters

Overestimation bias is not merely a theoretical concern. In practice, DQN's overestimation leads to:

Suboptimal policies: The agent commits to actions that appear valuable due to noise, ignoring truly optimal actions with more accurately estimated (lower) values.
Training instability: Inflated Q-values propagate through bootstrapping, creating a positive feedback loop where overestimation compounds across states.
Unreliable value estimates: When Q-values are inflated, they lose their meaning as predictions of actual returns, making debugging and analysis difficult.

Double DQN was shown to reduce overestimation across all 49 Atari games tested, with median performance improving and many games showing substantial gains -- for essentially zero additional implementation or computational cost.

Key Technical Details

Implementation cost: Changing one line of code -- the target computation. No new networks, no new hyperparameters, no additional memory.
Atari results: Double DQN improves median human-normalized score from ~93% (DQN) to ~117% across 49 games. Several games improve dramatically (e.g., Asterix: 6,012 to 28,188).
Overestimation reduction: On some games, DQN's Q-value estimates exceed true values by 2--10x. Double DQN reduces this gap to near zero.
Compatible with other improvements: Double DQN is orthogonal to experience replay variants, dueling architectures, and other DQN extensions. It is included as a standard component in Rainbow (rainbow-dqn.md).
Does not eliminate underestimation: Double DQN can slightly underestimate Q-values in some cases, but moderate underestimation is far less harmful than overestimation because it does not create runaway positive feedback.
The two networks are not independent: Unlike the original Double Q-learning proposal, the target network $w^{-}$ is a delayed copy of $w$ , so they are correlated. Empirically, this correlation is weak enough that the bias reduction is substantial.

Common Misconceptions

"Double DQN requires training two separate networks." It reuses the two networks DQN already maintains (online and target). There is no additional training, no additional forward passes during action selection, and no additional memory.
"Double DQN completely eliminates overestimation." It substantially reduces it but does not eliminate it entirely, because the online and target networks are correlated (the target is a delayed copy of the online network). However, the residual bias is small enough to be practically insignificant.
"Overestimation is always bad." In principle, consistent overestimation that preserves the ranking of actions would not affect policy quality. The problem is that overestimation is uneven -- some actions get inflated more than others, distorting the ranking and degrading the policy.
"Double DQN is an older, superseded technique." Double DQN remains a standard component of modern value-based methods. It is included in Rainbow, and the decoupling principle has been extended to continuous-action algorithms like TD3 (which uses clipped double Q-learning).

Connections to Other Concepts

double-dqn.md: (deep-q-networks.md) is the base algorithm that Double DQN improves with a one-line change.
target-networks.md: (target-networks.md) provide the second network that enables action evaluation to be decoupled from selection.
q-learning.md: (q-learning.md) inherits maximization bias from the $max$ operator even in the tabular case, though the effect is less severe without function approximation.
dueling-dqn.md: (dueling-dqn.md) changes the network architecture; Double DQN changes the target computation. The two are orthogonal and combinable.
rainbow-dqn.md: (rainbow-dqn.md) includes Double DQN as one of its six combined improvements.