One-Line Summary: The mathematical framework formalizing sequential decision-making with states, actions, transition probabilities, and rewards.
Prerequisites: Basic probability theory, conditional probability, what-is-reinforcement-learning.md.
What Is a Markov Decision Process?
Imagine you are navigating a city with a GPS. At every intersection (state), you choose a road (action). The traffic conditions (transition probabilities) determine where you actually end up, and you incur a travel time cost (reward). Your GPS does not care about how you arrived at the current intersection -- only where you are now matters for deciding the best next turn. That "only the present matters" property is the Markov property, and the entire setup is a Markov Decision Process (MDP).
An MDP provides the precise mathematical language for describing any problem where an agent makes sequential decisions under uncertainty. It is to reinforcement learning what the loss function is to supervised learning: the formal object that defines the problem.
How It Works
The MDP Tuple
An MDP is defined by a 5-tuple :
- -- A set of states (finite or infinite). See
states-actions-rewards.md. - -- A set of actions available to the agent (may depend on state: ).
- -- Transition dynamics: .
- -- Reward function: or sometimes . The expected reward is .
- -- Discount factor: . See
return-and-discount-factor.md.
The transition function satisfies:
The Markov Property
The defining feature of an MDP is the Markov property (also called "memorylessness"):
The future depends on the past only through the present state. This is not a restriction on the environment -- it is a requirement on the state representation. If the state captures all relevant information, the Markov property holds by construction.
Finite vs. Infinite MDPs
Finite MDPs have finite state and action spaces (, ). The transition dynamics can be stored as a tensor. Most theoretical results in RL are proven for finite MDPs.
Continuous MDPs have continuous state spaces (e.g., ), continuous action spaces (e.g., ), or both. Robotics problems typically involve continuous MDPs. The transition function becomes a conditional density: .
Episodic vs. Continuing Tasks
Episodic tasks have a natural terminal state. The interaction breaks into episodes, each starting from an initial state and ending at termination. Examples: a game of chess, navigating a maze, a dialogue turn.
The return for an episode of length is:
Continuing tasks have no terminal state and run indefinitely. Examples: process control, server load balancing, a robot maintaining balance. Here, discounting () is essential to keep the return finite:
Sutton & Barto (2018) unify both cases using an absorbing state that transitions only to itself with zero reward, allowing episodic tasks to be treated as a special case of continuing tasks.
The Dynamics Function
The full dynamics of an MDP are captured by the four-argument function:
From this, we can derive:
- Transition probabilities:
- Expected reward:
- Expected reward given transition:
Extensions Beyond Standard MDPs
| Extension | What Changes | Example |
|---|---|---|
| POMDP | Agent observes instead of | Robot with noisy sensors |
| Semi-MDP | Actions take variable time | Options framework |
| Dec-POMDP | Multiple agents, partial observability | Multi-robot coordination |
| Constrained MDP | Additional cost constraints | Safe RL |
| Factored MDP | State is a product of variables | Large structured domains |
Why It Matters
Every RL algorithm -- from tabular Q-learning to PPO to RLHF -- implicitly or explicitly assumes an underlying MDP (or extension thereof). Understanding MDPs is essential for: (1) correctly formulating a real-world problem as an RL problem, (2) understanding what theoretical guarantees apply, and (3) diagnosing failures when the Markov assumption is violated.
Key Technical Details
- A finite MDP with states and actions requires parameters to specify the transition function.
- The Markov property is about the state representation, not the environment. History-dependent environments can still be Markovian if the state is augmented (e.g., using frame stacking in Atari, where DQN concatenates the last 4 frames).
- Ergodic MDPs -- where every state is reachable from every other state under any policy -- guarantee that the value function is well-defined for all policies.
- When the transition function and reward function are known, the MDP can be solved exactly via dynamic programming. When they are unknown, we enter the realm of RL.
Common Misconceptions
"The Markov property means the environment has no memory." The environment can have arbitrarily complex dynamics. The Markov property states that the state representation captures all information needed for prediction. If it does not, you need a richer state (or a POMDP formulation).
"Real-world problems are never truly Markov." True in theory, but in practice, a sufficiently rich state representation makes the Markov assumption a reasonable approximation. Deep RL uses neural networks to learn representations that are approximately Markov.
"MDPs require discrete states and actions." MDPs generalize naturally to continuous spaces. Continuous-state MDPs are standard in control theory and robotics.
Connections to Other Concepts
what-is-reinforcement-learning.md-- MDPs formalize the RL problem.states-actions-rewards.md-- Detailed treatment of the components of the MDP tuple.return-and-discount-factor.md-- The objective function defined over MDPs.policies.md-- Solutions to MDPs are expressed as policies.bellman-equations.md-- The recursive equations that characterize value functions in MDPs.value-functions.md-- How to evaluate policies within the MDP framework.
Further Reading
- Puterman (1994) -- Markov Decision Processes: Discrete Stochastic Dynamic Programming. The authoritative mathematical treatment of MDPs covering existence of optimal policies, solution methods, and structural results.
- Sutton & Barto (2018) -- Reinforcement Learning: An Introduction, Chapters 3-4. Accessible introduction to the MDP formalism and dynamic programming solutions.
- Bellman (1957) -- Dynamic Programming. The foundational work introducing the principle of optimality and the Bellman equation.
- Kaelbling, Littman & Cassandra (1998) -- "Planning and acting in partially observable stochastic domains." Artificial Intelligence, 101. Definitive treatment of POMDPs.
- Bertsekas & Tsitsiklis (1996) -- Neuro-Dynamic Programming. Bridges MDPs with approximate/neurally parameterized solutions.