Linear Regression

One-Line Summary: Fitting a hyperplane to data by minimizing squared errors -- the most interpretable and foundational predictive model.

Prerequisites: Linear algebra (matrix operations, projections), calculus (partial derivatives), probability (expected value, variance), loss functions and optimization basics.

What Is Linear Regression?

Imagine you are an appraiser estimating house prices. You notice that price tends to increase with square footage, number of bedrooms, and neighborhood quality. Linear regression formalizes this intuition: it finds the best-fitting flat surface (a hyperplane) through a cloud of data points so that the overall prediction error is as small as possible.

Formally, we model the relationship between a response variable $y$ and a vector of $p$ predictors $x = (x_{1}, x_{2}, \dots, x_{p})^{T}$ as:

$y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p} + ϵ$

In matrix notation for $n$ observations, this becomes:

$y = X β + ϵ$

where $X$ is the $n \times (p + 1)$ design matrix (with a column of ones for the intercept), $β$ is the $(p + 1) \times 1$ coefficient vector, and $ϵ$ is the $n \times 1$ error vector with $E [ϵ] = 0$ and $Cov (ϵ) = σ^{2} I_{n}$ .

How It Works

The Ordinary Least Squares (OLS) Objective

We seek $β$ that minimizes the sum of squared residuals:

$\hat{β} = ar g min_{β} ∥ y - X β ∥^{2} = ar g min_{β} (y - X β)^{T} (y - X β)$

Taking the gradient with respect to $β$ , setting it to zero, and solving yields the normal equations:

$X^{T} X \hat{β} = X^{T} y$

When $X^{T} X$ is invertible (i.e., no perfect multicollinearity), the closed-form solution is:

$\hat{β} = (X^{T} X)^{- 1} X^{T} y$

Geometric Interpretation

OLS has an elegant geometric meaning. The fitted values $\hat{y} = X \hat{β}$ are the orthogonal projection of $y$ onto the column space of $X$ . The residual vector $e = y - \hat{y}$ is perpendicular to every column of $X$ , which is precisely the condition $X^{T} e = 0$ -- the normal equations restated geometrically. The hat matrix $H = X (X^{T} X)^{- 1} X^{T}$ is the projection matrix satisfying $\hat{y} = H y$ .

Gradient Descent Alternative

When $n$ or $p$ is very large, inverting $X^{T} X$ (an $O (p^{3})$ operation) becomes impractical. Gradient descent offers an iterative alternative. The gradient of the MSE loss with respect to $β$ is:

$\nabla_{β} MSE = \frac{2}{n} X^{T} (X β - y)$

At each step $t$ , we update:

$β^{(t + 1)} = β^{(t)} - α \cdot \frac{2}{n} X^{T} (X β^{(t)} - y)$

where $α$ is the learning rate. Three major variants exist:

Batch gradient descent: Uses all $n$ observations per update. Exact gradient but expensive per step.
Stochastic gradient descent (SGD): Uses a single random observation per update. Noisy but extremely fast per step and scales to massive datasets.
Mini-batch gradient descent: Uses a random subset of $b$ observations (typically $b = 32$ to $256$ ). Balances noise and computational efficiency; the dominant approach in practice.

For linear regression with MSE loss, the loss surface is a convex quadratic (a paraboloid in coefficient space), guaranteeing that gradient descent converges to the global minimum for a sufficiently small $α$ . The condition number $κ = d_{ma x}^{2} / d_{min}^{2}$ of $X^{T} X$ (ratio of largest to smallest eigenvalue) controls convergence speed: ill-conditioned problems converge slowly and benefit from feature scaling.

Assumptions of Classical Linear Regression

Linearity: The true relationship between $x$ and $E [y ∣ x]$ is linear in parameters.
Independence: Observations are independent of one another.
Homoscedasticity: The error variance is constant: $Var (ϵ_{i} ∣ x_{i}) = σ^{2}$ for all $i$ .
Normality of errors: $ϵ_{i} \sim N (0, σ^{2})$ . This is needed for exact inference (t-tests, F-tests) but not for OLS estimation itself.
No perfect multicollinearity: No predictor is an exact linear combination of others.

When these hold, the Gauss-Markov theorem guarantees that OLS is the Best Linear Unbiased Estimator (BLUE) -- it has the smallest variance among all linear unbiased estimators.

R-Squared and Model Fit

The coefficient of determination measures the proportion of variance explained:

$R^{2} = 1 - \frac{S S _{r es}}{S S _{t o t}} = 1 - \frac{\sum _{i} ( y _{i} - y ^ _{i} ) ^{2}}{\sum _{i} ( y _{i} - y ˉ ) ^{2}}$

$R^{2}$ always increases (or stays the same) when predictors are added, so the adjusted $R^{2}$ penalizes model complexity:

$R_{a d j}^{2} = 1 - \frac{S S _{r es} / ( n - p - 1 )}{S S _{t o t} / ( n - 1 )}$

Practical Example

Predicting house prices with square footage ( $x_{1}$ ) and number of bedrooms ( $x_{2}$ ):

$\overset{y}{^} = 25000 + 150 \cdot x_{1} + 8000 \cdot x_{2}$

Interpretation: holding bedrooms constant, each additional square foot adds $150 to the predicted price. This direct interpretability is linear regression's greatest practical strength.

Why It Matters

Linear regression is far more than a toy model. It is the baseline against which all other regression methods are compared. Its closed-form solution makes it computationally cheap and analytically tractable. The interpretability of coefficients -- each $β_{j}$ represents the expected change in $y$ per unit change in $x_{j}$ , holding all else constant -- makes it indispensable in science, economics, and policy. Many advanced techniques (ridge regression, LASSO, generalized linear models) are direct extensions of linear regression.

Beyond prediction, linear regression plays a central role in causal inference. In randomized experiments, regressing the outcome on treatment assignment recovers the average treatment effect. In observational studies, regression adjustment is the most common method for controlling confounders, though this requires strong assumptions about model specification. Understanding when regression coefficients have causal meaning versus merely predictive meaning is one of the most important distinctions in applied statistics.

Key Technical Details

The OLS estimator is unbiased: $E [\hat{β}] = β$ under correct specification.
The variance of the estimator is $Var (\hat{β}) = σ^{2} (X^{T} X)^{- 1}$ , estimated by replacing $σ^{2}$ with $s^{2} = S S_{r es} / (n - p - 1)$ .
Inverting $X^{T} X$ costs $O (p^{3})$ ; for large $p$ , use gradient descent or QR decomposition.
Adding irrelevant predictors inflates variance without reducing bias (bias-variance tradeoff).
Standardizing features (zero mean, unit variance) before fitting makes coefficients comparable in magnitude.

Common Misconceptions

"A high R-squared means the model is good." $R^{2}$ can be high due to overfitting, spurious correlations, or irrelevant predictors. Always check residuals and use adjusted $R^{2}$ or cross-validation.
"Linear regression assumes y is normally distributed." The normality assumption applies to the errors $ϵ$ , not to $y$ itself. The distribution of $y$ conditional on $x$ is what matters.
"Linear regression can only model straight lines." It models linear relationships in the parameters. You can include polynomial features like $x^{2}$ or interaction terms like $x_{1} x_{2}$ and still use OLS.
"OLS always has a unique solution." When $X^{T} X$ is singular (perfect multicollinearity), the solution is not unique. Regularization methods like Ridge regression resolve this.

Connections to Other Concepts

polynomial-regression.md: Extends linear regression by adding powers of predictors as features while retaining the OLS framework.
ridge-and-lasso-regression.md: Add penalty terms to the OLS objective to combat overfitting and multicollinearity.
regression-diagnostics.md: The toolkit for verifying whether linear regression assumptions actually hold for your data.
generalized-linear-models.md: Extend the linear regression framework to non-normal response distributions via link functions.
optimization-and-gradient-descent.md: The iterative optimization alternative when the closed-form solution is computationally infeasible.
bias-variance-tradeoff.md: Linear regression is low-bias for truly linear relationships but can have high variance with many correlated predictors.