Bias-Variance Tradeoff

One-Line Summary: The fundamental tension between underfitting and overfitting -- every model navigates this tradeoff whether you manage it or not.

Prerequisites: What Is Machine Learning, basic probability, expected value.

What Is the Bias-Variance Tradeoff?

Imagine throwing darts at a target. Bias is how far the center of your throws is from the bullseye -- a systematic offset. Variance is how spread out your throws are from each other. A perfect thrower has low bias and low variance. But in machine learning, reducing one often increases the other: a very flexible model can center on the bullseye (low bias) but scatters widely across different training sets (high variance), while a rigid model throws consistently (low variance) but may always miss to the left (high bias).

Formally, the bias-variance tradeoff decomposes the expected prediction error of a model into three irreducible components, revealing the fundamental tension that governs all learning algorithms.

How It Works

The Decomposition

Consider a regression setting where we want to predict $y = f (x) + ϵ$ , with $ϵ$ being noise with $E [ϵ] = 0$ and $Var (ϵ) = σ^{2}$ . Let $\hat{f} (x)$ be the model trained on a particular training set $D$ .

The expected prediction error at a point $x$ , averaged over all possible training sets, decomposes as:

$E_{D} [(y - \hat{f} (x))^{2}] = Bias^{2} (f (x) - E_{D} [\hat{f} (x)])^{2} + Variance E_{D} [(\hat{f} (x) - E_{D} [\hat{f} (x)])^{2}] + Irreducible Error σ^{2}$

Bias squared: How far the average prediction (across all possible training sets) is from the true function. This measures systematic error due to model assumptions.

Variance: How much predictions fluctuate across different training sets. This measures sensitivity to the particular data sampled.

Irreducible error: The noise inherent in the data. No model can eliminate this.

Deriving the Decomposition

Starting from the expected squared error and letting $\overset{ˉ}{f} (x) = E_{D} [\hat{f} (x)]$ :

$E [(y - \hat{f})^{2}] = E [(f + ϵ - \hat{f})^{2}]$

$= E [(f - \hat{f})^{2}] + 2 E [(f - \hat{f}) ϵ] + E [ϵ^{2}]$

Since $ϵ$ is independent of $\hat{f}$ and has zero mean, the cross term vanishes:

$= E [(f - \hat{f})^{2}] + σ^{2}$

Now decompose the first term by adding and subtracting $\overset{ˉ}{f}$ :

$E [(f - \hat{f})^{2}] = (f - \overset{ˉ}{f})^{2} + E [(\hat{f} - \overset{ˉ}{f})^{2}]$

This yields Bias $^{2}$ + Variance, completing the decomposition.

Model Complexity and the Tradeoff

As model complexity increases:

Bias decreases: More flexible models can approximate the true function more closely.
Variance increases: More flexible models are more sensitive to the specific training data.

The total error follows a U-shaped curve. The optimal complexity balances bias and variance to minimize total expected error.

Complexity	Bias	Variance	Total Error	Regime
Very low	High	Low	High	Underfitting
Optimal	Moderate	Moderate	Minimum	Sweet spot
Very high	Low	High	High	Overfitting

Concrete Example: Polynomial Regression

Fitting a polynomial of degree $d$ to noisy data from a true cubic function:

$d = 1$ (linear): High bias (cannot capture curvature), low variance. The line is similar regardless of which training points you sample.
$d = 3$ (cubic): Low bias (matches true form), moderate variance.
$d = 15$ : Near-zero training error, but wild oscillations between data points. Different training sets yield drastically different fits.

The Bullseye Diagram

Visualize four scenarios on a dart board:

Low bias, low variance: Darts clustered tightly around the center. Ideal but hard to achieve.
Low bias, high variance: Darts centered on the bullseye on average but scattered widely.
High bias, low variance: Darts tightly clustered but consistently off-center.
High bias, high variance: Darts scattered and off-center. The worst case.

Quantitative Example

Suppose the true function is $f (x) = sin (x)$ and we observe $y = sin (x) + ϵ$ with $ϵ \sim N (0, 0.1)$ . We fit models of varying complexity across 100 different training sets of size $n = 30$ :

Model	Avg. Bias $^{2}$	Avg. Variance	Irreducible	Total Error
Constant (degree 0)	0.42	0.003	0.10	0.52
Linear (degree 1)	0.12	0.01	0.10	0.23
Cubic (degree 3)	0.002	0.03	0.10	0.13
Degree 10	0.001	0.15	0.10	0.25
Degree 20	0.0005	0.58	0.10	0.68

The cubic model achieves the lowest total error. Higher-degree polynomials drive bias toward zero but their variance explodes, increasing total error.

The Effect of Training Set Size

For a fixed model complexity, increasing $n$ reduces variance without changing bias:

$Variance \approx O (\frac{p}{n})$

where $p$ is the effective number of parameters. This is why more data helps -- but only up to the point where bias dominates. If the model is fundamentally wrong (e.g., fitting a line to a sine wave), no amount of data will fix the bias.

Why It Matters

The bias-variance tradeoff is the theoretical lens through which all model selection decisions should be viewed. When you choose a model architecture, set a regularization strength, decide how many features to include, or determine training duration, you are implicitly navigating this tradeoff. Understanding it prevents two costly mistakes: spending weeks tuning a model that is fundamentally too simple (bias-dominated) or adding complexity when the real problem is insufficient data (variance-dominated).

Key Technical Details

The decomposition as presented applies exactly to squared error loss. For other losses (e.g., 0-1 loss in classification), analogous but more complex decompositions exist.
Ensemble methods explicitly exploit the tradeoff: bagging (random forests) reduces variance by averaging many high-variance models; boosting reduces bias by sequentially correcting errors.
Double descent: In highly over-parameterized models (e.g., deep neural networks), the test error can decrease again beyond the classical U-curve interpolation threshold. This challenges the simple U-shaped view but does not invalidate the underlying decomposition.
Regularization directly controls the tradeoff by constraining model complexity, effectively trading increased bias for decreased variance.
For a fixed model, more training data reduces variance without affecting bias.

Common Misconceptions

"You must always trade off bias for variance." Techniques like ensemble methods, better features, and more data can reduce both simultaneously. The tradeoff is about what happens when you vary complexity with everything else held fixed.
"Deep neural networks violate the bias-variance tradeoff." The double descent phenomenon is real, but the decomposition still holds. What changes is the behavior of each term in the over-parameterized regime, where implicit regularization plays a role.
"High training accuracy means low bias." Training accuracy reflects bias on the training set, but model bias is defined in terms of the expected prediction over all possible training sets relative to the true function.
"The tradeoff only matters for regression." The principle applies to classification and structured prediction as well, though the mathematical decomposition is cleanest for squared error.

Connections to Other Concepts

overfitting-and-underfitting.md: The practical manifestations of high variance and high bias, respectively.
regularization.md: The primary tool for managing the tradeoff by constraining model flexibility.
empirical-risk-minimization.md: The bias-variance tradeoff explains why minimizing training loss alone (ERM without regularization) can fail.
loss-functions.md: The decomposition depends on the loss; MSE gives the cleanest decomposition.
curse-of-dimensionality.md: High-dimensional spaces amplify variance because models must estimate more parameters from the same amount of data.