Markov Chain Monte Carlo

One-Line Summary: Sampling from complex posterior distributions by constructing Markov chains -- when exact inference is intractable.

Prerequisites: Bayesian inference, probability distributions, Markov chains, Monte Carlo integration.

What Is Markov Chain Monte Carlo?

Imagine you need to explore a vast, mountainous landscape to understand the distribution of altitude. You cannot see the whole terrain at once, and you cannot compute the average altitude analytically. Instead, you drop a hiker at a random point and let them wander: at each step they propose moving to a nearby location and accept the move based on certain rules. Over time, the hiker spends more time at higher-altitude regions (or lower, depending on the objective). By recording the hiker's path, you build up a picture of the entire landscape. This is the core idea of MCMC.

In Bayesian inference, we need to compute expectations under the posterior distribution:

$E_{p (θ ∣ D)} [f (θ)] = \int f (θ) p (θ ∣ D) d θ$

When the posterior $p (θ ∣ D)$ is high-dimensional or has no closed form, this integral is intractable. MCMC constructs a Markov chain whose stationary distribution is exactly $p (θ ∣ D)$ . After the chain converges, samples $θ^{(1)}, θ^{(2)}, \dots, θ^{(T)}$ approximate the posterior, and expectations are estimated as:

$E [f (θ)] \approx \frac{1}{T} \sum_{t = 1}^{T} f (θ^{(t)})$

How It Works

Markov Chains and Stationary Distributions

A Markov chain is a sequence of random variables $θ^{(0)}, θ^{(1)}, \dots$ where the distribution of $θ^{(t + 1)}$ depends only on $θ^{(t)}$ , not on earlier states. A chain has a stationary distribution $π (θ)$ if, once the chain's state is distributed as $π$ , all subsequent states remain distributed as $π$ .

For MCMC, we design a transition kernel $T (θ^{'} ∣ θ)$ such that the stationary distribution is our target posterior $p (θ ∣ D)$ . A sufficient condition is detailed balance:

$p (θ ∣ D) T (θ^{'} ∣ θ) = p (θ^{'} ∣ D) T (θ ∣ θ^{'})$

The Metropolis-Hastings Algorithm

Metropolis-Hastings (MH) is the foundational MCMC algorithm. Given current state $θ^{(t)}$ :

Propose a candidate $θ^{'}$ from a proposal distribution $q (θ^{'} ∣ θ^{(t)})$ .
Compute the acceptance probability:

$α = min (1, \frac{p ( θ ^{'} ∣ D ) q ( θ ^{(t)} ∣ θ ^{'} )}{p ( θ ^{(t)} ∣ D ) q ( θ ^{'} ∣ θ ^{(t)} )})$

Accept $θ^{'}$ with probability $α$ (set $θ^{(t + 1)} = θ^{'}$ ); otherwise stay ( $θ^{(t + 1)} = θ^{(t)}$ ).

A key advantage: only the ratio $p (θ^{'} ∣ D) / p (θ^{(t)} ∣ D)$ is needed, so the intractable normalizing constant $P (D)$ cancels. We only need to evaluate the unnormalized posterior $p (D ∣ θ) p (θ)$ .

The proposal $q$ is a design choice. A common option is a symmetric random walk: $q (θ^{'} ∣ θ) = N (θ^{'} ∣ θ, σ^{2} I)$ . If $σ$ is too small, the chain takes tiny steps and explores slowly. If $σ$ is too large, most proposals land in low-probability regions and are rejected. Tuning $σ$ to achieve an acceptance rate of roughly 23% (in high dimensions) is a standard heuristic.

Gibbs Sampling

Gibbs sampling is a special case of MH where each variable is sampled from its full conditional distribution while holding all other variables fixed:

$θ_{j}^{(t + 1)} \sim p (θ_{j} ∣ θ_{1}^{(t + 1)}, \dots, θ_{j - 1}^{(t + 1)}, θ_{j + 1}^{(t)}, \dots, θ_{d}^{(t)})$

Every proposal is accepted ( $α = 1$ ). Gibbs sampling is efficient when full conditionals are easy to sample from, which is common in conjugate models and graphical models (each variable's full conditional depends only on its Markov blanket).

However, Gibbs sampling can be slow when variables are strongly correlated, because it can only move along coordinate axes and has difficulty traversing narrow diagonal ridges in the posterior.

Burn-In and Thinning

Burn-in: The initial samples before the chain has converged to the stationary distribution are discarded. These early samples are influenced by the arbitrary initialization and do not represent the target posterior.

Thinning: Because consecutive MCMC samples are correlated (they form a Markov chain), practitioners sometimes keep only every $k$ -th sample to reduce autocorrelation. However, thinning discards information; it is often better to keep all post-burn-in samples and account for autocorrelation when computing standard errors.

Convergence Diagnostics

Since we never know for certain whether a chain has converged, several diagnostics help assess convergence:

Trace plots: Visual inspection of parameter values over iterations. A converged chain should look like a "fuzzy caterpillar" with no trends or drifts.
$\hat{R}$ (R-hat, Gelman-Rubin statistic): Compares within-chain and between-chain variance across multiple independent chains. Values near 1.0 (typically $\hat{R} < 1.01$ ) suggest convergence.
Effective Sample Size (ESS): Estimates the number of independent samples accounting for autocorrelation. An ESS of at least 400 per parameter is a common recommendation for reliable posterior summaries.

Hamiltonian Monte Carlo (HMC)

HMC exploits gradient information to make large, informed moves through parameter space. It treats the negative log-posterior as a "potential energy" and introduces auxiliary "momentum" variables $r$ :

$H (θ, r) = - lo g p (θ ∣ D) + \frac{1}{2} r^{⊤} r$

The algorithm simulates Hamiltonian dynamics using leapfrog integration for $L$ steps with step size $ϵ$ :

Half-step update of momentum: $r \leftarrow r - \frac{ϵ}{2} \nabla_{θ} (- lo g p (θ ∣ D))$
Full-step update of position: $θ \leftarrow θ + ϵ r$
Half-step update of momentum: $r \leftarrow r - \frac{ϵ}{2} \nabla_{θ} (- lo g p (θ ∣ D))$

The proposal is the endpoint after $L$ leapfrog steps, accepted via the Metropolis criterion using the Hamiltonian $H$ . HMC can traverse the posterior far more efficiently than random-walk MH, especially in high dimensions, because it follows the geometry of the distribution.

The No-U-Turn Sampler (NUTS)

HMC requires choosing the number of leapfrog steps $L$ and step size $ϵ$ . The No-U-Turn Sampler (NUTS) eliminates the need to set $L$ by automatically determining when the simulated trajectory begins to "turn around" (double back on itself). Combined with dual averaging for adapting $ϵ$ , NUTS is the default sampler in Stan and PyMC and is considered state-of-the-art for differentiable posteriors.

Why It Matters

MCMC is the backbone of practical Bayesian inference. It enables fitting complex hierarchical models, nonparametric models, and any posterior that lacks a closed-form solution. Modern probabilistic programming languages (Stan, PyMC, NumPyro) rely heavily on MCMC -- particularly HMC and NUTS -- to make Bayesian modeling accessible. MCMC provides asymptotically exact samples from the true posterior, giving it a fundamental correctness guarantee that approximate methods like variational inference lack.

Key Technical Details

MCMC requires only the unnormalized posterior -- the normalizing constant cancels in acceptance ratios.
Metropolis-Hastings is the general framework; Gibbs sampling and HMC are important special cases.
Convergence diagnostics ( $\hat{R}$ , ESS, trace plots) are essential -- never blindly trust MCMC output.
HMC requires gradients of the log-posterior, which modern autodiff frameworks provide automatically.
NUTS removes the need to hand-tune leapfrog trajectory length in HMC.
MCMC provides asymptotically exact samples but can be slow for very high-dimensional problems or highly multimodal posteriors.

Common Misconceptions

"More samples always means better results." If the chain has not converged or mixes poorly, a million correlated samples may be less informative than a hundred well-mixed ones. Effective sample size matters, not raw sample count.
"A single long chain is sufficient." Running multiple independent chains enables convergence diagnostics (like $\hat{R}$ ) and helps detect if a chain is stuck in one mode of a multimodal posterior.
"MCMC is always slow." For well-conditioned, moderate-dimensional posteriors, NUTS in Stan can produce thousands of effective samples in seconds. MCMC is slow when the posterior is very high-dimensional, strongly correlated, or multimodal.
"Thinning is always necessary." Thinning reduces storage but discards information. It is generally better to store all samples and compute effective sample sizes appropriately.

Connections to Other Concepts

bayesian-inference.md: MCMC is the primary computational tool for performing Bayesian inference when the posterior is intractable.
variational-inference.md: VI is the main alternative to MCMC. MCMC is asymptotically exact but slower; VI is faster but provides an approximation. The choice depends on the problem scale and accuracy requirements.
expectation-maximization.md: Monte Carlo EM replaces the intractable E-step expectation with MCMC samples from the posterior over latent variables.
graphical-models.md: Gibbs sampling exploits graphical model structure -- each variable is sampled conditional on its Markov blanket.
gaussian-processes.md: Fully Bayesian GP hyperparameter inference uses MCMC to sample from the posterior over kernel hyperparameters rather than optimizing a point estimate.