Probability Fundamentals

One-Line Summary: Random variables, distributions, Bayes' theorem, and conditional probability -- the language of uncertainty in ML.

Prerequisites: Basic set theory, Vectors and Matrices.

What Is Probability?

Uncertainty is inescapable in machine learning. Training data is a noisy sample from a larger population. Predictions are inherently uncertain. Models must quantify how confident they are. Probability theory provides the rigorous mathematical framework for reasoning about uncertainty.

Think of probability as a way to assign a number between 0 and 1 to events, where 0 means impossible and 1 means certain. If you flip a fair coin, the probability of heads is 0.5 -- not because the outcome is inherently random (physics could predict it) but because your information about the outcome is incomplete. This information-theoretic view connects naturally to the Bayesian perspective used throughout modern ML.

How It Works

Sample Spaces, Events, and Axioms

A sample space $Ω$ is the set of all possible outcomes. An event $A \subseteq Ω$ is a subset of outcomes. A probability measure $P$ assigns values to events satisfying Kolmogorov's axioms:

$P (A) \geq 0$ for all events $A$
$P (Ω) = 1$
For mutually exclusive events $A_{1}, A_{2}, \dots$ : $P (⋃_{i} A_{i}) = \sum_{i} P (A_{i})$

From these axioms, all of probability theory follows.

Conditional Probability and Independence

The probability of $A$ given that $B$ has occurred:

$P (A ∣ B) = \frac{P ( A \cap B )}{P ( B )}, P (B) > 0$

Events $A$ and $B$ are independent if $P (A \cap B) = P (A) P (B)$ , equivalently $P (A ∣ B) = P (A)$ . In ML, feature independence is a strong assumption (used in Naive Bayes) that rarely holds exactly but often works surprisingly well.

Bayes' Theorem

Bayes' theorem reverses conditioning:

$P (θ ∣ D) = \frac{P ( D ∣ θ ) P ( θ )}{P ( D )}$

In ML language: posterior $\propto$ likelihood $\times$ prior. This is the foundation of Bayesian inference:

$P (θ)$ : prior belief about parameters before seeing data
$P (D ∣ θ)$ : likelihood of observing data given parameters
$P (θ ∣ D)$ : posterior belief after observing data
$P (D) = \int P (D ∣ θ) P (θ) d θ$ : marginal likelihood (evidence)

Random Variables

A random variable $X$ is a function from the sample space to real numbers: $X : Ω \to R$ .

Discrete random variables take countably many values. They are characterized by a probability mass function (PMF): $p (x) = P (X = x)$ .

Continuous random variables take values in intervals. They are characterized by a probability density function (PDF) $f (x)$ where:

$P (a \leq X \leq b) = \int_{a}^{b} f (x) d x$

Note that $f (x)$ itself is not a probability -- it can exceed 1. Only integrals of the PDF over intervals yield probabilities.

The cumulative distribution function (CDF) $F (x) = P (X \leq x)$ works for both discrete and continuous variables and is always non-decreasing from 0 to 1.

Common Distributions

Bernoulli( $p$ ): Binary outcome with $P (X = 1) = p$ . Models coin flips, click/no-click, spam/not-spam.

Binomial( $n, p$ ): Number of successes in $n$ independent Bernoulli trials. $P (X = k) = (k n) p^{k} (1 - p)^{n - k}$ .

Poisson( $λ$ ): Count of events in a fixed interval. $P (X = k) = \frac{λ ^{k} e ^{- λ}}{k !}$ . Models rare events (website visits, hardware failures).

Uniform( $a, b$ ): $f (x) = \frac{1}{b - a}$ for $x \in [a, b]$ . Maximum ignorance about where a value falls in an interval.

Gaussian (Normal) $N (μ, σ^{2})$ :

$f (x) = \frac{1}{2 π σ ^{2}} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$

The most important distribution in ML. The Central Limit Theorem justifies its ubiquity: sums of many independent random variables tend to be Gaussian, regardless of their individual distributions.

Expectation, Variance, and Covariance

Expectation (mean):

$E [X] = \sum_{x} x \cdot p (x) (discrete), E [X] = \int x \cdot f (x) d x (continuous)$

Variance:

$Var (X) = E [(X - E [X])^{2}] = E [X^{2}] - (E [X])^{2}$

Covariance measures linear co-dependence between two variables:

$Cov (X, Y) = E [(X - E [X]) (Y - E [Y])]$

The covariance matrix $Σ$ for a random vector $X$ has entries $Σ_{ij} = Cov (X_{i}, X_{j})$ . It is always symmetric and positive semi-definite, making it amenable to eigendecomposition for PCA.

The Multivariate Gaussian

For a random vector $X \in R^{n}$ :

$f (x) = \frac{1}{( 2 π ) ^{n /2} ∣Σ ∣ ^{1/2}} exp (- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ))$

The exponent $(x - μ)^{T} Σ^{- 1} (x - μ)$ is the Mahalanobis distance -- a key quantity linking probability to distance metrics.

Why It Matters

Probability is the language in which ML models express uncertainty. Classification outputs are probabilities. Generative models define probability distributions over data. Loss functions like cross-entropy are derived from probabilistic principles. Without probability, there is no principled way to handle noise, make predictions under uncertainty, or compare models.

Key Technical Details

Law of total probability: $P (A) = \sum_{i} P (A ∣ B_{i}) P (B_{i})$ for a partition ${B_{i}}$ .
Linearity of expectation: $E [a X + bY] = a E [X] + b E [Y]$ , even if $X$ and $Y$ are dependent.
$Var (a X + b) = a^{2} Var (X)$ .
Correlation: $ρ (X, Y) = Cov (X, Y) / (σ_{X} σ_{Y}) \in [- 1, 1]$ . Zero correlation does not imply independence (except for Gaussians).
The softmax function maps logits to a valid probability distribution: $p_{i} = e^{z_{i}} / \sum_{j} e^{z_{j}}$ .

Common Misconceptions

"Probability and likelihood are the same." Probability is a function of outcomes given fixed parameters. Likelihood is a function of parameters given fixed data. This distinction is crucial for MLE and Bayesian inference.
"A PDF value is a probability." The PDF can exceed 1. Only the integral of the PDF over an interval is a probability.
"Uncorrelated means independent." Uncorrelated means zero linear dependence. Two variables can be uncorrelated yet strongly dependent (e.g., $X$ and $X^{2}$ where $X$ is symmetric around 0).

Connections to Other Concepts

statistical-inference.md: Uses probability distributions to draw conclusions about populations from samples.
maximum-likelihood-estimation.md: Finds parameters that maximize the probability of observed data under a model.
information-theory.md: Entropy and KL divergence are defined in terms of probability distributions.
matrix-decompositions.md: Eigendecomposition of the covariance matrix reveals the principal axes of variation.
norms-and-distance-metrics.md: The Mahalanobis distance is defined through the inverse covariance matrix.