The Dyna Architecture

One-Line Summary: A foundational framework that interleaves real environment experience with simulated experience generated by a learned model, unifying learning, planning, and acting in a single loop.

Prerequisites: model-based-vs-model-free.md, ../01-foundations/value-functions.md, ../02-tabular-methods/q-learning.md

What Is the Dyna Architecture?

Imagine a chess novice who plays one game per day against a real opponent but then spends hours replaying positions in their mind, imagining moves and outcomes. Each real game teaches them about the world; each mental simulation extracts additional lessons from that knowledge. They learn far faster than someone who only learns during actual games.

Dyna, introduced by Richard Sutton in 1991, formalizes this exact idea. The agent maintains three interacting components: (1) a model of the environment learned from real experience, (2) a value function (or policy) updated using both real and simulated experience, and (3) a planning process that generates simulated transitions from the model to perform additional value-function updates. The key insight is that real experience serves double duty -- it directly improves the value function and improves the model, which in turn generates more training data.

How It Works

The Dyna-Q Algorithm

Dyna-Q extends tabular Q-learning with model learning and planning. At each real timestep:

Step 1 -- Act: Select action $a$ in state $s$ using an $ϵ$ -greedy policy derived from $Q$ .

Step 2 -- Learn from reality: Observe reward $r$ and next state $s^{'}$ . Perform a direct RL update:

$Q (s, a) \leftarrow Q (s, a) + α [r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)]$

Step 3 -- Update the model: Store the transition in the model: $Model (s, a) \leftarrow (r, s^{'})$ . In tabular Dyna-Q, this is a deterministic lookup table mapping each experienced $(s, a)$ pair to the most recently observed $(r, s^{'})$ .

Step 4 -- Plan (repeat $n$ times): Sample a previously visited state-action pair $(s_{sim}, a_{sim})$ at random. Query the model to get $(\overset{r}{^}, \overset{s}{^}^{'}) = Model (s_{sim}, a_{sim})$ . Perform a simulated Q-learning update:

$Q (s_{sim}, a_{sim}) \leftarrow Q (s_{sim}, a_{sim}) + α [\overset{r}{^} + γ max_{a^{'}} Q (\overset{s}{^}^{'}, a^{'}) - Q (s_{sim}, a_{sim})]$

The parameter $n$ controls the ratio of planning steps to real steps. With $n = 0$ , Dyna-Q reduces to standard Q-learning. With $n > 0$ , the agent extracts additional value from each real interaction.

The Planning-Learning Spectrum

Dyna reveals that planning and learning are not fundamentally different operations. Both perform the same kind of update to the value function:

$Q (s, a) \leftarrow Q (s, a) + α [target - Q (s, a)]$

The only difference is the source of the target: real experience (learning) or model-generated experience (planning). This unification is one of Dyna's deepest contributions. By adjusting $n$ , we slide along a spectrum from pure model-free learning ( $n = 0$ ) to heavily model-based planning ( $n ≫ 0$ ).

The Effect of Planning Steps

Sutton's original experiments on a simple gridworld maze showed dramatic results. With $n = 50$ planning steps per real step, the agent solved the maze in roughly 3 episodes. With $n = 0$ (pure Q-learning), it required approximately 25 episodes -- an order of magnitude more real experience. The computational cost shifts from environment interactions (expensive) to model queries (cheap).

Dyna with Prioritized Sweeping

Random sampling of states for planning is wasteful. Prioritized sweeping (Moore & Atkeson, 1993) focuses planning on states whose values are most likely to change. When a state's value changes significantly, its predecessors (states that transition into it) are added to a priority queue ranked by the magnitude of their expected value change:

$Priority (s, a) = ∣ r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a) ∣$

Planning updates are performed in priority order. This dramatically accelerates convergence. In Sutton and Barto's blocking maze experiment, prioritized sweeping with $n = 5$ outperforms random Dyna-Q with $n = 50$ .

When the Model Is Wrong

Dyna's critical vulnerability is model inaccuracy. If the environment changes (a shortcut opens, a wall appears), the model reflects the old environment, and planning propagates stale information. Sutton introduced Dyna-Q+ to address this: each state-action pair tracks how long since it was actually tried, and the model adds a bonus reward $κ τ$ (where $τ$ is the time since last visit) to encourage re-exploration of states that may have changed. This transforms model staleness into exploration incentive.

Why It Matters

Dyna established the conceptual vocabulary for all subsequent model-based RL. The idea that a learned model can generate synthetic training data -- effectively multiplying the value of real experience -- is the foundation of modern approaches from Dreamer (see world-models.md) to MBPO (see planning-with-learned-models.md). Any system that uses a learned simulator to augment real data is, at its core, a descendant of Dyna.

Key Technical Details

The planning parameter $n$ is a hyperparameter trading computation for sample efficiency. Values of $n = 5$ to $n = 50$ are typical in tabular settings.
Tabular Dyna-Q assumes a deterministic model. For stochastic environments, the model must store distributions or sample from stored transitions.
Each planning step has $O (1)$ cost in tabular settings (table lookup + Q-update), making the computational overhead minimal per step.
Dyna's model is a sample model (generates transitions) rather than a distribution model (returns full probability distributions). Sample models are sufficient for simulation-based planning.
In deep RL, the Dyna principle manifests as training neural network dynamics models and generating rollouts to train a policy (as in MBPO, where the "planning steps" are short neural-network rollouts added to a replay buffer).

Common Misconceptions

"Dyna is just Q-learning with a replay buffer." Experience replay stores and replays real transitions. Dyna generates new, synthetic transitions from a learned model. A replayed transition is always faithful to what actually happened; a Dyna transition reflects the model's (possibly inaccurate) understanding. These are fundamentally different data sources.

"More planning steps are always better." If the model is inaccurate, excessive planning propagates errors and can degrade performance. The optimal $n$ depends on model quality: more accurate models support more planning.

"Dyna requires a tabular setting." The original Dyna-Q is tabular, but the Dyna principle -- interleaving real and simulated experience -- is architecture-agnostic. Deep Dyna-Q, MBPO, and Dreamer all implement the Dyna idea with neural network function approximation and models.

"The model must predict raw observations." In modern Dyna-inspired systems, the model often operates in a learned latent space, predicting compressed representations rather than raw pixels (see world-models.md, muzero.md).

Connections to Other Concepts

model-based-vs-model-free.md -- Dyna as the bridge between model-based and model-free RL.
world-models.md -- Modern evolution of Dyna using latent-space dynamics models and imagination-based training.
planning-with-learned-models.md -- Neural network dynamics models for trajectory optimization, extending Dyna's planning idea.
muzero.md -- Planning in a learned latent space with MCTS, a sophisticated descendant of Dyna's simulate-then-update loop.
../02-tabular-methods/q-learning.md -- The base algorithm that Dyna-Q augments with model-based planning.