One-Line Summary: Residual analysis, heteroscedasticity, multicollinearity, and influence points -- verifying assumptions before trusting results.
Prerequisites: Linear regression (OLS, assumptions, R-squared), probability distributions (normal, chi-squared), hypothesis testing (p-values, test statistics).
What Is Regression Diagnostics?
You have fit a linear regression model, obtained coefficients, and computed . Can you trust the results? Regression diagnostics is the practice of systematically checking whether the model's assumptions hold and whether individual data points are unduly driving the results. Think of it as a medical checkup for your model: the regression may look healthy on the surface, but diagnostics can reveal hidden pathologies -- non-constant variance, correlated predictors, or single observations that change the entire fit.
Without diagnostics, you risk reporting confidence intervals that are too narrow, p-values that are misleading, and predictions that are unreliable. Diagnostics transform regression from a black-box procedure into a principled statistical analysis.
How It Works
Residual Plots
Residuals are the primary diagnostic tool. Under correct model specification, residuals should appear as random noise with no discernible pattern.
Residuals vs. Fitted Values: Plot against . Look for:
- A horizontal band of constant width centered at zero (good).
- A funnel shape (heteroscedasticity).
- A curved pattern (nonlinearity -- consider polynomial terms or a transformation).
Residuals vs. Predictors: Plot against each predictor . Patterns suggest that the functional form for is misspecified (e.g., a quadratic term is needed).
Q-Q Plot (Normal Quantile-Quantile): Plot the ordered standardized residuals against theoretical quantiles of . Points should fall along the diagonal. Deviations in the tails indicate heavy-tailed or skewed error distributions. Standardized residuals are computed as:
where is the residual standard error and is the -th diagonal element of the hat matrix .
Scale-Location Plot: Plot against . A horizontal trend confirms homoscedasticity; an increasing trend indicates variance that grows with the predicted value.
Heteroscedasticity
When is not constant, OLS estimates remain unbiased but are no longer efficient, and standard errors are incorrect.
Breusch-Pagan Test: Regress the squared residuals on the predictors. Under homoscedasticity, the of this auxiliary regression should be near zero. The test statistic under .
White Test: A more general version that includes squares and cross-products of predictors in the auxiliary regression, detecting nonlinear forms of heteroscedasticity.
Remedies: Use heteroscedasticity-consistent (HC) standard errors (White's robust standard errors), apply a variance-stabilizing transformation (e.g., ), or use weighted least squares (WLS) where .
Multicollinearity
When predictors are highly correlated, individual coefficient estimates become unstable -- small changes in the data produce large changes in .
Variance Inflation Factor (VIF): For predictor , regress on all other predictors and compute . Then:
A VIF of 1 means no collinearity. VIF is a warning; VIF is a serious problem. The variance of is inflated by the factor VIF relative to the case of uncorrelated predictors.
Remedies: Remove or combine collinear predictors, apply principal component regression, or use Ridge regression (which was specifically designed to handle multicollinearity).
Influential Points and Outliers
Not all observations contribute equally to the fit. Some may disproportionately determine the regression surface.
Leverage: The -th diagonal element of the hat matrix, , measures how far is from the center of the predictor space. High-leverage points have unusual predictor values. The average leverage is ; points with deserve scrutiny.
Cook's Distance: Combines leverage and residual size to measure each observation's influence on the entire coefficient vector:
A common rule of thumb: or warrants investigation. Cook's distance answers: "How much would the fit change if observation were removed?"
DFFITS and DFBETAS: DFFITS measures the change in the fitted value when observation is deleted. DFBETAS measures the change in each individual coefficient .
Autocorrelation
When observations have a natural ordering (e.g., time series), errors may be correlated.
Durbin-Watson Test: Tests for first-order autocorrelation in the residuals. The test statistic is:
suggests no autocorrelation; indicates positive autocorrelation (common in time series); indicates negative autocorrelation.
Remedies: Include lagged variables, use time-series models (ARIMA), or apply Newey-West standard errors that are robust to autocorrelation.
What to Do When Assumptions Are Violated
| Violation | Consequence | Remedy |
|---|---|---|
| Nonlinearity | Biased predictions | Add polynomial/interaction terms, transform variables |
| Heteroscedasticity | Invalid standard errors | Robust SE, WLS, log transform |
| Multicollinearity | Unstable coefficients | Ridge regression, drop/combine variables, PCA |
| Non-normal errors | Invalid confidence intervals | Bootstrap, transform , use robust regression |
| Autocorrelation | Underestimated SE | Time-series methods, Newey-West SE |
| Influential points | Distorted fit | Investigate, robust regression (Huber, M-estimators) |
Why It Matters
A regression model is only as trustworthy as its assumptions. Publishing coefficients and p-values without checking assumptions is like reporting a patient's temperature without calibrating the thermometer. In practice, assumption violations are the norm, not the exception. Diagnostics tell you which violations are present, how severe they are, and what corrective actions to take. They are the difference between statistical analysis and numerical fortune-telling.
Key Technical Details
- Standardized residuals, studentized residuals, and externally studentized residuals differ in how they estimate the error variance. Externally studentized residuals (leaving out observation ) follow a distribution under normality, making them useful for formal outlier tests.
- The hat matrix is idempotent () and symmetric, with eigenvalues 0 and 1. The trace equals the number of parameters.
- Cook's distance can be decomposed as the product of a leverage component and a residual component, showing that influence requires both unusual predictors and a large residual.
- VIF is directly related to the eigenvalues of the correlation matrix of predictors; small eigenvalues correspond to high VIF directions.
Common Misconceptions
- "If R-squared is high, the model is fine." A high says nothing about whether assumptions hold. A model with severe heteroscedasticity or influential outliers can have and still produce misleading inference.
- "Outliers should always be removed." Outliers may be the most informative observations in the dataset. Remove them only if they result from data entry errors or measurement failures; otherwise, use robust methods.
- "The Shapiro-Wilk test on residuals is essential." Formal normality tests are overpowered on large samples (rejecting trivial deviations) and underpowered on small samples. Q-Q plots are more informative in practice.
- "Multicollinearity makes the model wrong." It does not bias predictions; it inflates the variance of individual coefficients. If prediction (not interpretation) is the goal, multicollinearity is less of a concern.
- "Diagnostics are only needed once." After any model change (adding variables, transformations, removing observations), diagnostics should be rerun.
Connections to Other Concepts
linear-regression.md: Diagnostics verify the assumptions that justify OLS estimation and inference.polynomial-regression.md: Residual plots with curved patterns suggest adding polynomial terms to the model.ridge-and-lasso-regression.md: High VIF values are a direct indicator that regularization may improve the model.generalized-linear-models.md: GLMs have their own diagnostic tools (deviance residuals, Pearson residuals) analogous to those for linear regression.bias-variance-tradeoff.md: Influential points can dramatically affect both bias and variance of the estimator.cross-validation.md: Complements diagnostics by providing an assumption-free estimate of predictive performance.
Further Reading
- Belsley, Kuh, and Welsch, "Regression Diagnostics" (1980) -- The definitive reference on influence measures, leverage, and collinearity diagnostics.
- Fox, "Applied Regression Analysis and Generalized Linear Models" (2015) -- Excellent treatment of diagnostic methods with practical examples.
- Cook and Weisberg, "Residuals and Influence in Regression" (1982) -- Foundational work on Cook's distance and related influence diagnostics.