One-Line Summary: Plotting performance vs. training set size or training iterations -- diagnosing whether you need more data, more capacity, or more regularization.

Prerequisites: Bias-variance trade-off, overfitting, underfitting, cross-validation, regression metrics, classification metrics.

What Are Learning Curves?

Imagine you are studying for an exam. At first, both your practice scores and your real test scores improve rapidly. Eventually, your practice scores keep climbing (you memorize the practice material), but your real test scores plateau. The gap between the two tells you whether you need more practice problems, better study methods, or more fundamental understanding. Learning curves are the same diagnostic applied to machine learning models -- they plot performance against a resource axis (training set size, training iterations, or model complexity) and reveal whether a model suffers from bias, variance, or is already near optimal.

Formally, a learning curve is a function that maps a resource quantity (typically the number of training examples ) to a performance measure (training error, validation error, or both).

How It Works

Training Curves vs. Validation Curves

Two related but distinct diagnostic plots share the name "learning curve":

Sample learning curve (the classic form): Plots training error and validation error as a function of training set size . Generated by repeatedly training the model on subsets of increasing size and evaluating on a fixed validation set.

Iteration learning curve (training curve): Plots training loss and validation loss as a function of training iterations (epochs, gradient steps). This is the standard monitoring tool during deep learning training.

Validation curve (complexity curve): Plots training and validation performance as a function of a hyperparameter controlling model complexity (e.g., tree depth, regularization strength, number of features). This is not a "learning" curve per se, but a complementary diagnostic.

Diagnosing High Bias (Underfitting)

A model with high bias is too simple to capture the underlying pattern.

Signature on sample learning curves:

  • Training error is high (the model cannot even fit the training data well).
  • Validation error is high and close to the training error.
  • Both curves plateau early -- adding more data barely helps.
  • The gap between curves is small.

What to do: Increase model complexity (more features, deeper trees, wider networks), reduce regularization, or use a more expressive model family.

Diagnosing High Variance (Overfitting)

A model with high variance memorizes training data but fails to generalize.

Signature on sample learning curves:

  • Training error is low (the model fits the training data very well).
  • Validation error is much higher than training error.
  • The gap between curves is large.
  • Validation error decreases slowly as training set size increases, suggesting more data would help.

What to do: Get more training data, increase regularization, reduce model complexity, use dropout or early stopping, or apply data augmentation.

The Effect of Adding More Data

Adding data helps when the model has high variance (the gap is large), because more training examples constrain the model, pulling validation error down toward training error. However, if the model has high bias (both errors are high and plateaued), more data is unlikely to help -- the model simply cannot represent the true function.

This is a crucial practical insight: collecting more data is expensive. Learning curves tell you whether it is worth the investment.

The Effect of Model Complexity

As complexity increases:

  • Training error generally decreases (more capacity to fit the data).
  • Validation error initially decreases (better representation), then increases (overfitting).

The optimal complexity is at the minimum of the validation curve. This U-shaped validation curve is a direct visualization of the bias-variance trade-off.

Validation Curves (Performance vs. Hyperparameter)

A validation curve plots a performance metric against a single hyperparameter value. For example, plotting accuracy vs. the regularization parameter in an SVM:

  • Small : High regularization, high bias, both training and validation accuracy are low.
  • Optimal : Training accuracy is good, validation accuracy is at its peak.
  • Large : Low regularization, training accuracy is near-perfect, validation accuracy drops.

This plot directly guides hyperparameter tuning and is a focused version of the general learning curve concept.

Generating Learning Curves in Practice

For a sample learning curve with -fold cross-validation:

  1. Define a sequence of training set sizes: .
  2. For each , repeat -fold CV but use only examples from each training fold.
  3. Record mean and standard deviation of both training and validation metrics.
  4. Plot both curves with error bands.

The standard deviation bands are important -- they show whether the curves are noisy (suggesting you need more CV repetitions) or stable.

Practical Interpretation Guide

PatternDiagnosisAction
Both curves high, small gapHigh biasIncrease complexity, add features
Training low, validation high, large gapHigh varianceMore data, regularize, simplify
Both curves low, small gapGood fitModel is near-optimal for this data
Validation still decreasing at max More data would helpCollect more data if possible
Validation plateaued but gap persistsIrreducible gapConsider ensemble methods or better features

Why It Matters

Learning curves provide the diagnostic information needed to make informed decisions about where to invest effort. Without them, practitioners often waste time collecting more data when the real problem is model capacity, or adding complexity when the model simply needs more examples. They transform model improvement from guesswork into systematic analysis.

Key Technical Details

  • Computational cost: Generating sample learning curves requires training the model many times at different dataset sizes, each with cross-validation. This can be expensive for large models.
  • Monotonicity: Training error is expected to increase with (more data is harder to memorize), while validation error is expected to decrease. Violations suggest problems with the evaluation setup.
  • Log-scale axis: For training set size, a log-scale x-axis often reveals the rate of improvement more clearly. Power law relationships appear as straight lines on a log-log plot.
  • Early stopping: Iteration learning curves motivate early stopping -- halt training when validation loss stops decreasing to prevent overfitting.

Common Misconceptions

  • "If validation error is high, I always need more data." High validation error could indicate high bias (underfitting) rather than high variance. Only high-variance models benefit from more data. Check whether training error is also high to disambiguate.
  • "Learning curves are only for deep learning." Sample learning curves are useful for any model family, including decision trees, SVMs, and linear models. They are among the most informative diagnostics available.
  • "The gap between training and validation always indicates a problem." Some gap is expected due to irreducible noise and the inherent difference between fitting and predicting. Only a large or growing gap is concerning.
  • "Training until training loss reaches zero is optimal." Zero training loss usually signals overfitting, especially for models with high capacity. The iteration learning curve (with validation loss) tells you when to stop.

Connections to Other Concepts

  • cross-validation.md: Used to generate reliable learning curve estimates at each training set size.
  • hyperparameter-tuning.md: Validation curves (a type of learning curve) guide hyperparameter selection.
  • regression-metrics.md: The y-axis of learning curves plots these metrics.
  • model-comparison.md: Learning curves provide context beyond a single number -- two models may have similar average performance but very different learning dynamics.
  • calibration.md: Monitoring calibration metrics across training iterations can reveal when models become overconfident.

Further Reading

  • Perlich, Provost & Simonoff, "Tree Induction vs. Logistic Regression: A Learning-Curve Analysis" (2003) -- Demonstrates how learning curves reveal when complex models surpass simpler ones.
  • Hestness et al., "Deep Learning Scaling Predictability Empirically" (2017) -- Shows power law learning curves for deep networks and their implications for scaling.
  • Hastie, Tibshirani & Friedman, "The Elements of Statistical Learning," Section 7.2 (2009) -- Formal treatment of bias, variance, and the role of training set size.