Batch Normalization

One-Line Summary: Normalizing layer inputs within each mini-batch -- stabilizing training, enabling higher learning rates, and acting as regularization.

Prerequisites: Mean and variance, perceptrons and multilayer networks, backpropagation, activation functions, gradient descent.

What Is Batch Normalization?

Imagine you are a teacher grading exams from different schools. One school's exams are scored 0-100, another's 0-10, and a third uses letter grades. Before you can fairly compare students, you must standardize the scores. Batch normalization does the same for neural network activations: it standardizes the inputs to each layer so that every layer sees data with consistent statistical properties, regardless of how the preceding layers have shifted and scaled their outputs.

Formally, Batch Normalization (BatchNorm), introduced by Ioffe and Szegedy in 2015, normalizes the pre-activation values within each mini-batch to have zero mean and unit variance, then applies a learnable affine transformation. This simple operation has dramatic effects on training stability and speed.

How It Works

The BatchNorm Formula

For a mini-batch $B = {x_{1}, \dots, x_{B}}$ of pre-activation values at a particular neuron:

Step 1: Compute batch statistics

$μ_{B} = \frac{1}{B} \sum_{i = 1}^{B} x_{i}, σ_{B}^{2} = \frac{1}{B} \sum_{i = 1}^{B} (x_{i} - μ_{B})^{2}$

Step 2: Normalize

$\overset{x}{^}_{i} = \frac{x _{i} - μ _{B}}{σ _{B}^{2} + ϵ}$

where $ϵ$ (typically $1 0^{- 5}$ ) prevents division by zero.

Step 3: Scale and shift with learnable parameters

$y_{i} = γ \overset{x}{^}_{i} + β$

The learnable parameters $γ$ (scale) and $β$ (shift) allow the network to undo the normalization if that is optimal. If $γ = σ_{B}^{2} + ϵ$ and $β = μ_{B}$ , the transformation is an identity. This means BatchNorm can never reduce the network's representational capacity.

Internal Covariate Shift

The original motivation for BatchNorm was the internal covariate shift hypothesis: as parameters in earlier layers change during training, the distribution of inputs to later layers shifts, forcing those layers to continually adapt to a moving target. By normalizing inputs, BatchNorm stabilizes these distributions.

However, subsequent research (Santurkar et al., 2018) suggested that BatchNorm's effectiveness stems more from smoothing the loss landscape (reducing the Lipschitz constant of the loss and its gradients) than from reducing covariate shift per se. The smoothed landscape enables larger learning rates and faster convergence.

Training vs. Inference

During training, $μ_{B}$ and $σ_{B}^{2}$ are computed from the current mini-batch. The network also maintains exponential moving averages:

$μ_{running} \leftarrow (1 - α) μ_{running} + α μ_{B}$ $σ_{running}^{2} \leftarrow (1 - α) σ_{running}^{2} + α σ_{B}^{2}$

where $α$ is the momentum (typically 0.1).

During inference, the running statistics are used instead of batch statistics, ensuring deterministic outputs that do not depend on other examples in the batch.

This train/inference discrepancy is a frequent source of bugs. Forgetting to switch a model to evaluation mode before inference will cause it to use batch statistics, producing erratic outputs, especially with small batch sizes.

Where to Place BatchNorm

There are two conventions:

Pre-activation: $z \to BN (z) \to f (\cdot)$ (normalize before activation). This is the original formulation.
Post-activation: $z \to f (z) \to BN (\cdot)$ (normalize after activation).

Both work in practice, with pre-activation being more common. The key insight is that BatchNorm keeps pre-activations in the non-saturating regime of the activation function, improving gradient flow.

Normalization Variants

Layer Normalization (LayerNorm): Normalizes across all features within a single example, rather than across the batch. Statistics are computed as $μ = \frac{1}{D} \sum_{i = 1}^{D} x_{i}$ for each example independently. LayerNorm has no train/inference discrepancy and works with any batch size. It is the standard for transformers and recurrent networks.

Group Normalization (GroupNorm): Divides channels into groups and normalizes within each group. A compromise between LayerNorm (one group) and Instance Normalization (each channel is its own group). Useful when batch sizes are too small for reliable batch statistics (e.g., object detection, video).

RMS Normalization (RMSNorm): A simplified variant that skips mean centering:

$\overset{x}{^}_{i} = \frac{x _{i}}{RMS ( x )} \cdot γ_{i}, RMS (x) = \frac{1}{D} \sum_{i = 1}^{D} x_{i}^{2}$

RMSNorm is computationally cheaper and is used in LLaMA, Gemma, and other modern large language models. Empirically, it performs comparably to LayerNorm.

Why It Matters

BatchNorm was one of the most impactful techniques of the 2015-2020 deep learning era. It enabled training deeper networks, using learning rates 10x larger, and reduced sensitivity to weight initialization. Its variants (LayerNorm, RMSNorm) are essential components of every modern transformer architecture.

Key Technical Details

BatchNorm adds $2 D$ learnable parameters per layer ( $γ, β \in R^{D}$ ) and stores $2 D$ running statistics.
Effective batch sizes for BatchNorm should be at least 16-32 for reliable statistics; smaller batches favor GroupNorm or LayerNorm.
BatchNorm provides implicit regularization through the noise in mini-batch statistics. Larger batches reduce this noise and may require additional explicit regularization.
In convolutional networks, BatchNorm is applied per-channel: $γ, β \in R^{C}$ where $C$ is the number of channels, with statistics computed across the batch and spatial dimensions.
When using BatchNorm immediately after a linear or convolutional layer, the bias term in that layer is redundant (absorbed by $β$ ) and should be omitted.

Common Misconceptions

"BatchNorm works because it reduces internal covariate shift." This was the original hypothesis, but Santurkar et al. (2018) showed that BatchNorm's primary benefit is smoothing the loss landscape. Networks with artificially induced covariate shift still train well with BatchNorm.
"BatchNorm and LayerNorm are interchangeable." BatchNorm normalizes across the batch (coupling examples together); LayerNorm normalizes across features (each example is independent). They have very different properties for sequence models, attention mechanisms, and small-batch regimes.
"BatchNorm makes initialization irrelevant." BatchNorm reduces sensitivity to initialization, but extremely poor initialization (e.g., all zeros) can still cause problems. Proper initialization combined with BatchNorm gives the best results.

Connections to Other Concepts

activation-functions.md: BatchNorm keeps pre-activations in the sensitive (non-saturating) range of activations like sigmoid and tanh.
weight-initialization.md: BatchNorm reduces but does not eliminate the importance of initialization; both work together for stable training.
backpropagation.md: The gradient flows through the normalization statistics, coupling the gradients of all examples in the batch.
dropout-and-regularization.md: BatchNorm provides implicit regularization via mini-batch noise; combining it with dropout requires care (they can interact poorly).
optimizers.md: BatchNorm's loss landscape smoothing enables larger learning rates, which interacts with optimizer step size tuning.