One-Line Summary: Fitting a hyperplane to data by minimizing squared errors -- the most interpretable and foundational predictive model.
Prerequisites: Linear algebra (matrix operations, projections), calculus (partial derivatives), probability (expected value, variance), loss functions and optimization basics.
What Is Linear Regression?
Imagine you are an appraiser estimating house prices. You notice that price tends to increase with square footage, number of bedrooms, and neighborhood quality. Linear regression formalizes this intuition: it finds the best-fitting flat surface (a hyperplane) through a cloud of data points so that the overall prediction error is as small as possible.
Formally, we model the relationship between a response variable and a vector of predictors as:
In matrix notation for observations, this becomes:
where is the design matrix (with a column of ones for the intercept), is the coefficient vector, and is the error vector with and .
How It Works
The Ordinary Least Squares (OLS) Objective
We seek that minimizes the sum of squared residuals:
Taking the gradient with respect to , setting it to zero, and solving yields the normal equations:
When is invertible (i.e., no perfect multicollinearity), the closed-form solution is:
Geometric Interpretation
OLS has an elegant geometric meaning. The fitted values are the orthogonal projection of onto the column space of . The residual vector is perpendicular to every column of , which is precisely the condition -- the normal equations restated geometrically. The hat matrix is the projection matrix satisfying .
Gradient Descent Alternative
When or is very large, inverting (an operation) becomes impractical. Gradient descent offers an iterative alternative. The gradient of the MSE loss with respect to is:
At each step , we update:
where is the learning rate. Three major variants exist:
- Batch gradient descent: Uses all observations per update. Exact gradient but expensive per step.
- Stochastic gradient descent (SGD): Uses a single random observation per update. Noisy but extremely fast per step and scales to massive datasets.
- Mini-batch gradient descent: Uses a random subset of observations (typically to ). Balances noise and computational efficiency; the dominant approach in practice.
For linear regression with MSE loss, the loss surface is a convex quadratic (a paraboloid in coefficient space), guaranteeing that gradient descent converges to the global minimum for a sufficiently small . The condition number of (ratio of largest to smallest eigenvalue) controls convergence speed: ill-conditioned problems converge slowly and benefit from feature scaling.
Assumptions of Classical Linear Regression
- Linearity: The true relationship between and is linear in parameters.
- Independence: Observations are independent of one another.
- Homoscedasticity: The error variance is constant: for all .
- Normality of errors: . This is needed for exact inference (t-tests, F-tests) but not for OLS estimation itself.
- No perfect multicollinearity: No predictor is an exact linear combination of others.
When these hold, the Gauss-Markov theorem guarantees that OLS is the Best Linear Unbiased Estimator (BLUE) -- it has the smallest variance among all linear unbiased estimators.
R-Squared and Model Fit
The coefficient of determination measures the proportion of variance explained:
always increases (or stays the same) when predictors are added, so the adjusted penalizes model complexity:
Practical Example
Predicting house prices with square footage () and number of bedrooms ():
Interpretation: holding bedrooms constant, each additional square foot adds $150 to the predicted price. This direct interpretability is linear regression's greatest practical strength.
Why It Matters
Linear regression is far more than a toy model. It is the baseline against which all other regression methods are compared. Its closed-form solution makes it computationally cheap and analytically tractable. The interpretability of coefficients -- each represents the expected change in per unit change in , holding all else constant -- makes it indispensable in science, economics, and policy. Many advanced techniques (ridge regression, LASSO, generalized linear models) are direct extensions of linear regression.
Beyond prediction, linear regression plays a central role in causal inference. In randomized experiments, regressing the outcome on treatment assignment recovers the average treatment effect. In observational studies, regression adjustment is the most common method for controlling confounders, though this requires strong assumptions about model specification. Understanding when regression coefficients have causal meaning versus merely predictive meaning is one of the most important distinctions in applied statistics.
Key Technical Details
- The OLS estimator is unbiased: under correct specification.
- The variance of the estimator is , estimated by replacing with .
- Inverting costs ; for large , use gradient descent or QR decomposition.
- Adding irrelevant predictors inflates variance without reducing bias (bias-variance tradeoff).
- Standardizing features (zero mean, unit variance) before fitting makes coefficients comparable in magnitude.
Common Misconceptions
- "A high R-squared means the model is good." can be high due to overfitting, spurious correlations, or irrelevant predictors. Always check residuals and use adjusted or cross-validation.
- "Linear regression assumes y is normally distributed." The normality assumption applies to the errors , not to itself. The distribution of conditional on is what matters.
- "Linear regression can only model straight lines." It models linear relationships in the parameters. You can include polynomial features like or interaction terms like and still use OLS.
- "OLS always has a unique solution." When is singular (perfect multicollinearity), the solution is not unique. Regularization methods like Ridge regression resolve this.
Connections to Other Concepts
polynomial-regression.md: Extends linear regression by adding powers of predictors as features while retaining the OLS framework.ridge-and-lasso-regression.md: Add penalty terms to the OLS objective to combat overfitting and multicollinearity.regression-diagnostics.md: The toolkit for verifying whether linear regression assumptions actually hold for your data.generalized-linear-models.md: Extend the linear regression framework to non-normal response distributions via link functions.optimization-and-gradient-descent.md: The iterative optimization alternative when the closed-form solution is computationally infeasible.bias-variance-tradeoff.md: Linear regression is low-bias for truly linear relationships but can have high variance with many correlated predictors.
Further Reading
- Hastie, Tibshirani, and Friedman, "The Elements of Statistical Learning" (2009) -- Chapter 3 provides a rigorous treatment of linear methods for regression.
- Shalev-Shwartz and Ben-David, "Understanding Machine Learning" (2014) -- Formalizes linear regression within the PAC learning framework.
- Angrist and Pischke, "Mostly Harmless Econometrics" (2009) -- Explains the causal interpretation of regression coefficients and when it is valid.