World Models

One-Line Summary: Learning compressed latent representations of environment dynamics so an agent can "dream" -- planning and even training entirely within an imagined version of the world.

Prerequisites: model-based-vs-model-free.md, dyna-architecture.md, ../01-foundations/markov-decision-processes.md, basic familiarity with VAEs and RNNs

What Is a World Model?

Imagine you are planning a road trip. You do not need to physically drive every possible route to decide which one is best. Instead, you have an internal mental model of geography, traffic patterns, and road conditions. You simulate routes in your head -- "If I take I-95, there will be traffic near DC; if I take the coastal route, it will be longer but scenic." You are imagining the consequences of actions without actually experiencing them.

A world model in RL is a learned neural network that serves the same function: it captures the dynamics of the environment in a compressed, abstract form and allows the agent to simulate future trajectories in its imagination. The agent can then train a policy on these imagined trajectories, dramatically reducing the need for real environment interaction. This is sometimes called "learning to dream."

The critical innovation over earlier model-based approaches is that world models learn in a latent space -- a compact, learned representation -- rather than trying to predict raw high-dimensional observations like pixels. This makes them scalable to visually complex environments.

How It Works

Ha & Schmidhuber (2018): The Original World Models

The seminal "World Models" paper proposed a three-component architecture:

1. Vision Model (VAE): A variational autoencoder compresses each observation $o_{t}$ (e.g., a 64x64 image) into a low-dimensional latent vector $z_{t} \in R^{32}$ :

$z_{t} \sim q_{ϕ} (z ∣ o_{t})$

2. Memory Model (MDN-RNN): A recurrent neural network with a Mixture Density Network output predicts the distribution over next latent states given the current latent state, action, and hidden state:

$P (z_{t + 1} ∣ a_{t}, z_{t}, h_{t}) = \sum_{i = 1}^{K} π_{i} \cdot N (z_{t + 1}; μ_{i}, σ_{i}^{2})$

where $h_{t}$ is the RNN hidden state encoding history, and the MDN models the output as a mixture of $K$ Gaussians.

3. Controller: A simple linear policy $a_{t} = W \cdot [z_{t}, h_{t}] + b$ maps the concatenated latent state and hidden memory to actions. Crucially, this controller is trained entirely inside the dream -- using trajectories generated by the VAE and MDN-RNN, never requiring real environment interaction during policy optimization.

The original paper demonstrated agents learning to drive in a car racing environment (VizDoom) by training entirely within their dreams after an initial phase of random data collection.

Dreamer and the RSSM Architecture

Hafner et al. (2020, 2021, 2023) developed the Dreamer family (DreamerV1, V2, V3), which substantially advanced world models using the Recurrent State-Space Model (RSSM).

The RSSM maintains two types of latent state at each step:

Deterministic state $h_{t}$ : An RNN hidden state capturing long-term temporal dependencies:

$h_{t} = f_{θ} (h_{t - 1}, z_{t - 1}, a_{t - 1})$

Stochastic state $z_{t}$ : A categorical or Gaussian latent variable capturing uncertainty:

$Prior: \overset{z}{^}_{t} \sim p_{θ} (z_{t} ∣ h_{t}) Posterior: z_{t} \sim q_{θ} (z_{t} ∣ h_{t}, o_{t})$

The prior predicts the stochastic state from the deterministic state alone (no observation), enabling purely imagined rollouts. The posterior incorporates the actual observation and is used during model training. The model is trained to minimize:

$L = E [- ln p_{θ} (o_{t} ∣ h_{t}, z_{t}) + β \cdot D_{KL} [q_{θ} (z_{t} ∣ h_{t}, o_{t}) ∥ p_{θ} (z_{t} ∣ h_{t})]]$

The first term is reconstruction loss; the second (KL divergence) regularizes the posterior toward the prior, ensuring the prior is useful for imagination.

Dreaming: Imagination-Based Policy Training

Once the world model is trained, Dreamer trains an actor-critic policy entirely within imagined latent trajectories:

Start from a real state $s_{t}$ encoded into $(h_{t}, z_{t})$ .
Roll out $H$ steps (typically $H = 15$ ) using only the prior and the policy: $a_{t} \sim π_{ψ} (a ∣ h_{t}, z_{t})$ , $h_{t + 1} = f_{θ} (h_{t}, z_{t}, a_{t})$ , $\overset{z}{^}_{t + 1} \sim p_{θ} (z ∣ h_{t + 1})$ .
Predict rewards $\overset{r}{^}_{t}$ and values $\hat{V} (h_{t}, z_{t})$ along the imagined trajectory.
Backpropagate through the entire imagined trajectory to update the actor using straight-through gradients (DreamerV1) or Reinforce-style gradients (DreamerV2/V3).

DreamerV3: Scaling Across Domains

DreamerV3 (Hafner et al., 2023) introduced key innovations for generality: discrete latent states using categorical distributions ( $32$ classes $\times$ $32$ categories), symlog predictions for reward and value normalization, and careful hyperparameter robustness. A single set of hyperparameters achieved strong performance across 150+ tasks spanning continuous control, Atari, DMLab, Minecraft, and Crafter -- with no task-specific tuning. Notably, DreamerV3 was the first algorithm to collect diamonds in Minecraft from scratch without human demonstrations or curricula.

Why It Matters

World models represent a path toward sample-efficient, general-purpose RL agents. By decoupling representation learning (the world model) from behavioral optimization (the policy), they enable agents to learn rich dynamics from limited data and then cheaply generate unlimited imagined experience for training. This mirrors a hypothesis about biological intelligence: that brains maintain internal models of the world and "simulate" future scenarios during sleep and idle states.

Key Technical Details

The original World Models paper used a latent space of dimension 32 for the VAE and a hidden state of 256 for the MDN-RNN.
DreamerV2 uses categorical latents: 32 distributions each over 32 classes, yielding a discrete latent space of $3 2^{32}$ possible states. Discrete latents outperform Gaussian latents in practice.
Imagination horizons in Dreamer are typically $H = 15$ steps, balancing model accuracy against planning depth.
DreamerV3 on Atari 100k achieves superhuman performance on many games with only 400k environment steps (~2 hours of real-time play), compared to 200M steps for DQN.
RSSM training requires both the reconstruction loss and the KL loss. The KL weight $β$ is often split into two terms with separate weights for the prior and posterior directions (KL balancing).
World models can be trained on offline datasets, enabling a form of offline model-based RL.

Common Misconceptions

"World models must reconstruct pixel observations." Reconstruction is a training signal, not the goal. What matters is that the latent dynamics are accurate. DreamerV3 uses symlog-scaled reconstruction precisely because pixel-perfect fidelity is unnecessary. MuZero (see muzero.md) dispenses with reconstruction entirely.

"Training in imagination introduces unrecoverable model bias." While imagined trajectories are imperfect, the policy is regularly re-grounded by encoding real observations. The interplay between real data (which updates the model) and imagined data (which updates the policy) creates a self-correcting loop.

"World models are only for visual environments." World models work with any observation modality. They are most advantageous when observations are high-dimensional and compressible (images, point clouds), but the latent dynamics framework applies equally to proprioceptive or structured state spaces.

"The 'dreaming' metaphor is just marketing." The parallel to biological dreaming is substantive. Neuroscience research suggests that hippocampal replay during sleep serves a similar function: re-simulating experiences to consolidate learning (Diba & Buzsaki, 2007).

Connections to Other Concepts

dyna-architecture.md -- World models are the modern deep learning realization of Dyna's model-based planning.
model-based-vs-model-free.md -- World models exemplify the sample-efficiency advantages of model-based approaches.
muzero.md -- Plans in latent space like Dreamer but uses MCTS instead of actor-critic and discards reconstruction.
planning-with-learned-models.md -- Alternative approaches to planning with learned dynamics (MPC, shooting methods).
monte-carlo-tree-search.md -- Tree-based planning that can operate on top of a world model.