One-Line Summary: Random variables, distributions, Bayes' theorem, and conditional probability -- the language of uncertainty in ML.
Prerequisites: Basic set theory, Vectors and Matrices.
What Is Probability?
Uncertainty is inescapable in machine learning. Training data is a noisy sample from a larger population. Predictions are inherently uncertain. Models must quantify how confident they are. Probability theory provides the rigorous mathematical framework for reasoning about uncertainty.
Think of probability as a way to assign a number between 0 and 1 to events, where 0 means impossible and 1 means certain. If you flip a fair coin, the probability of heads is 0.5 -- not because the outcome is inherently random (physics could predict it) but because your information about the outcome is incomplete. This information-theoretic view connects naturally to the Bayesian perspective used throughout modern ML.
How It Works
Sample Spaces, Events, and Axioms
A sample space is the set of all possible outcomes. An event is a subset of outcomes. A probability measure assigns values to events satisfying Kolmogorov's axioms:
- for all events
- For mutually exclusive events :
From these axioms, all of probability theory follows.
Conditional Probability and Independence
The probability of given that has occurred:
Events and are independent if , equivalently . In ML, feature independence is a strong assumption (used in Naive Bayes) that rarely holds exactly but often works surprisingly well.
Bayes' Theorem
Bayes' theorem reverses conditioning:
In ML language: posterior likelihood prior. This is the foundation of Bayesian inference:
- : prior belief about parameters before seeing data
- : likelihood of observing data given parameters
- : posterior belief after observing data
- : marginal likelihood (evidence)
Random Variables
A random variable is a function from the sample space to real numbers: .
Discrete random variables take countably many values. They are characterized by a probability mass function (PMF): .
Continuous random variables take values in intervals. They are characterized by a probability density function (PDF) where:
Note that itself is not a probability -- it can exceed 1. Only integrals of the PDF over intervals yield probabilities.
The cumulative distribution function (CDF) works for both discrete and continuous variables and is always non-decreasing from 0 to 1.
Common Distributions
Bernoulli(): Binary outcome with . Models coin flips, click/no-click, spam/not-spam.
Binomial(): Number of successes in independent Bernoulli trials. .
Poisson(): Count of events in a fixed interval. . Models rare events (website visits, hardware failures).
Uniform(): for . Maximum ignorance about where a value falls in an interval.
Gaussian (Normal) :
The most important distribution in ML. The Central Limit Theorem justifies its ubiquity: sums of many independent random variables tend to be Gaussian, regardless of their individual distributions.
Expectation, Variance, and Covariance
Expectation (mean):
Variance:
Covariance measures linear co-dependence between two variables:
The covariance matrix for a random vector has entries . It is always symmetric and positive semi-definite, making it amenable to eigendecomposition for PCA.
The Multivariate Gaussian
For a random vector :
The exponent is the Mahalanobis distance -- a key quantity linking probability to distance metrics.
Why It Matters
Probability is the language in which ML models express uncertainty. Classification outputs are probabilities. Generative models define probability distributions over data. Loss functions like cross-entropy are derived from probabilistic principles. Without probability, there is no principled way to handle noise, make predictions under uncertainty, or compare models.
Key Technical Details
- Law of total probability: for a partition .
- Linearity of expectation: , even if and are dependent.
- .
- Correlation: . Zero correlation does not imply independence (except for Gaussians).
- The softmax function maps logits to a valid probability distribution: .
Common Misconceptions
- "Probability and likelihood are the same." Probability is a function of outcomes given fixed parameters. Likelihood is a function of parameters given fixed data. This distinction is crucial for MLE and Bayesian inference.
- "A PDF value is a probability." The PDF can exceed 1. Only the integral of the PDF over an interval is a probability.
- "Uncorrelated means independent." Uncorrelated means zero linear dependence. Two variables can be uncorrelated yet strongly dependent (e.g., and where is symmetric around 0).
Connections to Other Concepts
statistical-inference.md: Uses probability distributions to draw conclusions about populations from samples.maximum-likelihood-estimation.md: Finds parameters that maximize the probability of observed data under a model.information-theory.md: Entropy and KL divergence are defined in terms of probability distributions.matrix-decompositions.md: Eigendecomposition of the covariance matrix reveals the principal axes of variation.norms-and-distance-metrics.md: The Mahalanobis distance is defined through the inverse covariance matrix.
Further Reading
- Blitzstein & Hwang, Introduction to Probability (2019) -- Clear, example-driven probability textbook with ML-relevant exercises.
- Bishop, Pattern Recognition and Machine Learning, Chapter 1-2 (2006) -- Probability theory presented in the context of ML.
- Jaynes, Probability Theory: The Logic of Science (2003) -- Deep philosophical treatment of probability as extended logic.