Generalized Linear Models

One-Line Summary: Extending linear regression to non-normal responses via link functions -- unifying logistic, Poisson, and other regression types.

Prerequisites: Linear regression (OLS, assumptions), probability distributions (Bernoulli, Poisson, exponential family), maximum likelihood estimation, calculus (chain rule, Newton's method).

What Is a Generalized Linear Model?

Ordinary linear regression assumes that the response variable is continuous and normally distributed around its mean. But what if you are predicting whether a customer will churn (binary outcome), how many accidents occur at an intersection per year (count data), or how long until a machine fails (positive continuous)? Forcing these responses into a standard linear regression violates fundamental assumptions: binary data is not Gaussian, counts cannot be negative, and durations are not symmetric.

Generalized linear models (GLMs) extend linear regression to handle all of these situations within a single unified framework. The key idea is elegant: instead of modeling the response mean directly as a linear function of predictors, GLMs model a transformation of the mean (the link function) as linear, while allowing the response to follow any distribution from the exponential family.

Think of it this way: linear regression draws a straight line through the data. GLMs draw a straight line through a transformed version of the data, then invert the transformation to produce predictions on the original scale.

How It Works

The Three Components of a GLM

Every GLM is specified by three components:

Random Component: The response $y_{i}$ follows a distribution from the exponential family:

$f (y_{i} ∣ θ_{i}, ϕ) = exp {\frac{y _{i} θ _{i} - b ( θ _{i} )}{a ( ϕ )} + c (y_{i}, ϕ)}$

where $θ_{i}$ is the natural (canonical) parameter, $ϕ$ is a dispersion parameter, and $b (\cdot)$ , $a (\cdot)$ , $c (\cdot)$ are known functions defining the specific distribution. The mean and variance are:

$μ_{i} = E [y_{i}] = b^{'} (θ_{i}), Var (y_{i}) = b^{''} (θ_{i}) \cdot a (ϕ)$

Systematic Component: A linear predictor formed from the covariates:

$η_{i} = x_{i}^{T} β = β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p}$

Link Function: A monotonic, differentiable function $g$ that connects the mean to the linear predictor:

$g (μ_{i}) = η_{i} ⟺ μ_{i} = g^{- 1} (η_{i})$

Exponential Family Distributions

The exponential family includes many common distributions:

Distribution	Support	Natural Parameter $θ$	$b (θ)$	Canonical Link
Normal	$R$	$μ$	$θ^{2} /2$	Identity: $g (μ) = μ$
Bernoulli	${0, 1}$	$lo g \frac{μ}{1 - μ}$	$lo g (1 + e^{θ})$	Logit: $g (μ) = lo g \frac{μ}{1 - μ}$
Poisson	${0, 1, 2, \dots}$	$lo g μ$	$e^{θ}$	Log: $g (μ) = lo g μ$
Gamma	$(0, \infty)$	$- 1/ μ$	$- lo g (- θ)$	Inverse: $g (μ) = 1/ μ$

The canonical link function sets $g (μ) = θ$ , linking the linear predictor directly to the natural parameter. Using the canonical link simplifies estimation and yields desirable statistical properties, but non-canonical links can also be used.

Logistic Regression as a GLM

Binary classification via logistic regression is a GLM with:

Random component: $y_{i} \sim Bernoulli (μ_{i})$
Link function: logit, $g (μ_{i}) = lo g \frac{μ _{i}}{1 - μ _{i}}$
Model: $lo g \frac{P ( y _{i} = 1∣ x _{i} )}{P ( y _{i} = 0∣ x _{i} )} = x_{i}^{T} β$

The inverse link gives the predicted probability:

$P (y_{i} = 1∣ x_{i}) = \frac{1}{1 + e ^{- x_{i}^{T} β}}$

Coefficients are interpreted on the log-odds scale: a unit increase in $x_{j}$ multiplies the odds by $e^{β_{j}}$ .

Poisson Regression as a GLM

For count data (e.g., number of insurance claims):

Random component: $y_{i} \sim Poisson (μ_{i})$
Link function: log, $g (μ_{i}) = lo g μ_{i}$
Model: $lo g E [y_{i} ∣ x_{i}] = x_{i}^{T} β$

The inverse link ensures predictions are non-negative: $\overset{μ}{^}_{i} = e^{x_{i}^{T} β}$ . Coefficients are interpreted as multiplicative: a unit increase in $x_{j}$ multiplies the expected count by $e^{β_{j}}$ .

Estimation via IRLS

GLMs are fit by maximum likelihood. The log-likelihood for the exponential family is:

$ℓ (β) = \sum_{i = 1}^{n} \frac{y _{i} θ _{i} - b ( θ _{i} )}{a ( ϕ )} + const.$

There is generally no closed-form solution (except for the normal distribution, which recovers OLS). Instead, we use Iteratively Reweighted Least Squares (IRLS):

Initialize $β^{(0)}$ (e.g., from OLS on the link-transformed responses).
At each iteration $t$ , compute working responses $z_{i}^{(t)}$ and weights $w_{i}^{(t)}$ :

$z_{i}^{(t)} = η_{i}^{(t)} + (y_{i} - μ_{i}^{(t)}) \cdot g^{'} (μ_{i}^{(t)})$

$w_{i}^{(t)} = \frac{1}{g ^{'} ( μ _{i}^{(t)} ) ^{2} \cdot Var ( y _{i} ∣ μ _{i}^{(t)} )}$

Update: $β^{(t + 1)} = (X^{T} W^{(t)} X)^{- 1} X^{T} W^{(t)} z^{(t)}$
Repeat until convergence.

Each iteration is a weighted least squares problem, hence the name. IRLS is a form of Fisher scoring, which is equivalent to Newton-Raphson when the canonical link is used.

Overdispersion

The Poisson distribution assumes $Var (y) = μ$ (mean equals variance). In practice, count data often exhibits overdispersion: $Var (y) > μ$ . Ignoring overdispersion leads to underestimated standard errors and spuriously significant coefficients.

Remedies include:

Quasi-Poisson: Introduces a dispersion parameter $ϕ$ so that $Var (y) = ϕ μ$ , estimated from the data.
Negative binomial regression: Models count data with a variance function $Var (y) = μ + μ^{2} / κ$ , accommodating extra-Poisson variation.

Deviance as Goodness of Fit

The deviance generalizes the residual sum of squares to GLMs:

$D = 2 [ℓ (\hat{μ}_{s a t u r a t e d}) - ℓ (\hat{μ}_{m o d e l})]$

where the saturated model has one parameter per observation (fitting the data perfectly). The deviance measures how far the fitted model is from this perfect fit. For the normal distribution with identity link, the deviance reduces to the RSS.

The null deviance (intercept-only model) minus the residual deviance (fitted model) measures the explained deviance, analogous to $R^{2}$ . For nested models, the difference in deviances follows approximately a $χ^{2}$ distribution, enabling likelihood ratio tests.

Why It Matters

GLMs unify an enormous range of statistical models under a single theoretical and computational framework. Instead of learning separate methods for binary outcomes, counts, and continuous data, a practitioner learns one framework and selects the appropriate distribution and link function for the problem at hand. This is both conceptually elegant and practically powerful. GLMs are the backbone of statistical modeling in epidemiology, insurance, ecology, and many other fields where response variables are not Gaussian.

Key Technical Details

GLMs estimate $p + 1$ parameters (including intercept) via maximum likelihood, not OLS. Standard errors come from the observed or expected Fisher information matrix.
The canonical link guarantees that the sufficient statistic for $β$ is $X^{T} y$ , and the log-likelihood is concave, ensuring a unique maximum.
Residual types for GLMs include deviance residuals, Pearson residuals, and working residuals, each useful for different diagnostic purposes.
AIC and BIC can be used for model comparison across GLMs with the same response distribution.
Regularized GLMs (e.g., penalized logistic regression with $ℓ_{1}$ or $ℓ_{2}$ penalties) are standard in high-dimensional settings.

Common Misconceptions

"GLMs are nonlinear models." GLMs are linear in the link-transformed mean. The systematic component $η = X β$ is linear; only the mapping from $η$ to $μ$ is nonlinear.
"You need the canonical link." Non-canonical links are perfectly valid. For example, the probit link (inverse normal CDF) is a common alternative to logit for binary data and may be preferable when the latent variable interpretation is natural.
"R-squared works for GLMs." The classical $R^{2}$ is not well-defined for GLMs. Use deviance explained, pseudo- $R^{2}$ measures (McFadden's, Nagelkerke's), or information criteria instead.
"GLMs handle any response distribution." Only distributions in the exponential family are covered. Heavy-tailed distributions (Cauchy) or mixture distributions require other approaches.
"Logistic regression is unrelated to linear regression." Logistic regression is a GLM that shares the same linear predictor structure as standard linear regression, differing only in the choice of distribution and link function.

Connections to Other Concepts

linear-regression.md: Linear regression is a GLM with a normal distribution and identity link function -- the simplest special case.
ridge-and-lasso-regression.md: Regularization extends directly to GLMs, producing penalized logistic regression and penalized Poisson regression for high-dimensional problems.
regression-diagnostics.md: GLMs have analogous diagnostic tools (deviance residuals, leverage in the working model, Cook's distance for GLMs) for checking model adequacy.
polynomial-regression.md: Polynomial and interaction terms can be included in the linear predictor $η$ of any GLM.
maximum-likelihood-estimation.md: GLM fitting is a direct application of MLE, with IRLS as the optimization algorithm.
zero-shot-classification.md: Logistic regression (a GLM) is the foundational classifier and the bridge between regression and classification in supervised learning.