Ridge and Lasso Regression

One-Line Summary: L2 and L1 penalties that shrink coefficients toward zero -- Ridge for stability, Lasso for sparsity and feature selection.

Prerequisites: Linear regression (OLS, normal equations), matrix algebra, bias-variance tradeoff, cross-validation, norms ( $ℓ_{1}$ , $ℓ_{2}$ ).

What Is Regularized Regression?

Suppose you are predicting a patient's blood pressure from 500 genomic markers measured on only 100 patients. OLS fails spectacularly here: with $p > n$ , the system is underdetermined and $X^{T} X$ is singular. Even when $p < n$ , highly correlated predictors inflate coefficient variance, producing unstable and unreliable estimates. Regularization adds a penalty term to the OLS objective that constrains the coefficients, trading a small increase in bias for a large reduction in variance.

Ridge and Lasso regression are the two most important regularized regression methods. They differ in the geometry of their penalty, and this geometric difference has profound consequences for the solutions they produce.

How It Works

Ridge Regression (L2 Penalty)

Ridge regression minimizes the penalized objective:

$\hat{β}_{r i d g e} = ar g min_{β} {∥ y - X β ∥_{2}^{2} + λ ∥ β ∥_{2}^{2}}$

where $λ \geq 0$ is the regularization parameter. The closed-form solution is:

$\hat{β}_{r i d g e} = (X^{T} X + λ I)^{- 1} X^{T} y$

The addition of $λ I$ to $X^{T} X$ ensures invertibility regardless of multicollinearity. As $λ \to 0$ , the solution approaches OLS; as $λ \to \infty$ , all coefficients shrink toward zero (but never reach exactly zero).

In terms of the singular value decomposition $X = U D V^{T}$ , the Ridge estimator shrinks each OLS coefficient along the principal component directions by a factor of $d_{j}^{2} / (d_{j}^{2} + λ)$ , where $d_{j}$ is the $j$ -th singular value. Components with small singular values (high-variance directions) are shrunk the most.

Lasso Regression (L1 Penalty)

The Lasso (Least Absolute Shrinkage and Selection Operator) replaces the $ℓ_{2}$ penalty with an $ℓ_{1}$ penalty:

$\hat{β}_{l a sso} = ar g min_{β} {∥ y - X β ∥_{2}^{2} + λ ∥ β ∥_{1}}$

where $∥ β ∥_{1} = \sum_{j = 1}^{p} ∣ β_{j} ∣$ . Unlike Ridge, the Lasso has no closed-form solution and requires iterative algorithms such as coordinate descent or ISTA (Iterative Shrinkage-Thresholding Algorithm).

The defining property of Lasso is sparsity: for sufficiently large $λ$ , many coefficients are driven to exactly zero, effectively performing feature selection.

Geometric Interpretation

The regularization objectives can be rewritten as constrained optimization problems:

Ridge: Minimize $∥ y - X β ∥^{2}$ subject to $∥ β ∥_{2}^{2} \leq t$
Lasso: Minimize $∥ y - X β ∥^{2}$ subject to $∥ β ∥_{1} \leq t$

Geometrically, the OLS solution lies at the center of elliptical contours of the loss function. The constraint region for Ridge is a sphere ( $ℓ_{2}$ ball), while for Lasso it is a diamond ( $ℓ_{1}$ ball). The solution is where the elliptical contours first touch the constraint region.

The diamond shape of the $ℓ_{1}$ ball has corners aligned with the coordinate axes. The expanding elliptical contours are far more likely to first make contact at a corner of the diamond than at a generic point on its surface. At a corner, one or more coordinates are exactly zero -- this is the geometric reason Lasso produces sparse solutions. The sphere, having no corners, almost never yields exact zeros.

Elastic Net

The Elastic Net combines both penalties:

$\hat{β}_{e n e t} = ar g min_{β} {∥ y - X β ∥^{2} + λ_{1} ∥ β ∥_{1} + λ_{2} ∥ β ∥_{2}^{2}}$

Or equivalently, with a mixing parameter $α \in [0, 1]$ :

$λ [α ∥ β ∥_{1} + (1 - α) ∥ β ∥_{2}^{2}]$

Elastic Net inherits Lasso's sparsity while resolving its tendency to arbitrarily select one variable from a group of highly correlated predictors. When predictors come in correlated groups, Elastic Net tends to select or exclude the entire group together.

The Regularization Path

As $λ$ varies from $0$ to $\infty$ , the coefficients trace out a regularization path. For Ridge, each coefficient shrinks smoothly and monotonically toward zero. For Lasso, the path is piecewise linear: coefficients enter or leave the active set at discrete values of $λ$ . The LARS (Least Angle Regression) algorithm computes the entire Lasso path at essentially the cost of a single OLS fit.

Coordinate Descent for Lasso

The most widely used algorithm for Lasso is coordinate descent. It cycles through each coefficient $β_{j}$ and applies the soft-thresholding operator:

$\hat{β}_{j} \leftarrow S (\frac{1}{n} \sum_{i = 1}^{n} x_{ij} (y_{i} - \overset{y}{^}_{i}^{(- j)}), λ /2)$

where $\overset{y}{^}_{i}^{(- j)}$ is the prediction excluding predictor $j$ , and the soft-thresholding operator is:

$S (z, γ) = sign (z) \cdot max (∣ z ∣ - γ, 0)$

When $∣ z ∣ \leq γ$ , the coefficient is set exactly to zero. This is the mechanism by which Lasso achieves sparsity algorithmically. Coordinate descent is fast because each step has a closed-form solution, and it exploits sparsity by skipping zero coefficients.

Choosing Lambda via Cross-Validation

The standard practice is:

Define a grid of $λ$ values (typically on a log scale, from $λ_{ma x}$ where all Lasso coefficients are zero down to $λ_{ma x} /1000$ ).
For each $λ$ , perform $k$ -fold cross-validation and record the mean CV error.
Select $λ_{min}$ (minimizing CV error) or $λ_{1 se}$ (largest $λ$ within one standard error of the minimum, preferring simpler models).

The value $λ_{ma x} = \frac{1}{n} ∥ X^{T} y ∥_{\infty}$ is the smallest $λ$ for which all Lasso coefficients are zero, providing a natural upper bound for the search grid.

Practical Example

Predicting gene expression from 1000 SNPs (genetic variants) on 200 samples. OLS is impossible ( $p > n$ ). Ridge regression produces a model using all 1000 SNPs with small but nonzero coefficients. Lasso selects 15 SNPs with nonzero coefficients, identifying a sparse set of putatively causal variants. The biologist can then focus experiments on these 15 genes.

Why It Matters

Regularization is arguably the single most important idea in modern supervised learning. It provides a principled mechanism for controlling model complexity, preventing overfitting, and handling the $p > n$ regime that is ubiquitous in genomics, text analysis, and high-dimensional data. Ridge and Lasso are the workhorses of regularized regression and serve as the foundation for understanding more complex penalized models.

The practical impact is enormous. In genomics, Lasso enables genome-wide association studies to identify disease-associated genes from millions of candidate variants. In natural language processing, regularized regression handles sparse bag-of-words features with vocabularies exceeding 100,000 terms. In finance, Ridge regression stabilizes portfolio optimization when the covariance matrix is estimated from limited data. The core idea -- penalizing complexity -- extends far beyond linear models to neural networks (weight decay is $ℓ_{2}$ regularization), support vector machines, and virtually every modern machine learning method.

Key Technical Details

Always standardize features before applying Ridge or Lasso so the penalty treats all coefficients equally. The intercept is typically not penalized.
Ridge has a Bayesian interpretation: it corresponds to a Gaussian prior $β_{j} \sim N (0, τ^{2})$ on the coefficients. Lasso corresponds to a Laplace prior $β_{j} \sim Laplace (0, b)$ .
Lasso is inconsistent for variable selection unless the irrepresentable condition holds (a constraint on the correlation structure of $X$ ).
The degrees of freedom for Ridge regression are $df (λ) = tr [X (X^{T} X + λ I)^{- 1} X^{T}] = \sum_{j} d_{j}^{2} / (d_{j}^{2} + λ)$ .
Computational cost: Ridge is $O (p^{3})$ via the closed form; Lasso via coordinate descent is $O (n p)$ per iteration, typically converging in few iterations.

Common Misconceptions

"Ridge and Lasso are fundamentally different methods." Both are penalized least squares; they differ only in the norm used for the penalty. This single geometric difference (sphere vs. diamond) drives all downstream differences.
"Lasso always outperforms Ridge." When the true model is dense (many small nonzero coefficients), Ridge typically outperforms Lasso. Lasso excels when the truth is sparse.
"Setting lambda to zero gives the best possible model." Zero regularization recovers OLS, which overfits in high dimensions. Some nonzero $λ$ almost always improves out-of-sample performance.
"Lasso selects the correct variables." Lasso can miss relevant variables or select irrelevant ones, especially with correlated predictors. It is a useful heuristic for feature selection, not an oracle.

Connections to Other Concepts

linear-regression.md: Ridge and Lasso generalize OLS by adding a penalty term; they reduce to OLS when $λ = 0$ .
polynomial-regression.md: Regularization is critical when using high-degree polynomial features to prevent the coefficient explosion that causes overfitting.
bias-variance-tradeoff.md: $λ$ directly controls the bias-variance balance. Increasing $λ$ increases bias but decreases variance.
regression-diagnostics.md: Multicollinearity (high VIF) is a signal that Ridge or Lasso may improve over OLS.
cross-validation.md: The standard method for selecting the regularization strength $λ$ .
generalized-linear-models.md: Regularization extends naturally to GLMs, yielding penalized logistic regression and penalized Poisson regression.