Model Comparison

One-Line Summary: Paired t-tests, McNemar's test, and Wilcoxon signed-rank -- determining if performance differences are real or noise.

Prerequisites: Cross-validation, hypothesis testing, classification metrics, regression metrics.

What Is Model Comparison?

Imagine two weather forecasting systems. System A achieves 85.2% accuracy and System B achieves 84.7% accuracy on the same test set. Is A genuinely better, or is the difference just a lucky draw of test examples? Without statistical rigor, we might confidently deploy A when the difference is mere noise -- or dismiss B when it is actually superior on the broader population. Model comparison provides the statistical machinery to answer: "Is this performance difference real?"

Formally, given two models $f_{A}$ and $f_{B}$ evaluated on the same data, we test the null hypothesis $H_{0} : E [L_{A}] = E [L_{B}]$ (the models have equal expected performance) against the alternative $H_{1} : E [L_{A}] \neq = E [L_{B}]$ .

How It Works

Why Comparing Means Is Not Enough

Suppose 5-fold CV produces accuracy estimates for two models:

Fold	Model A	Model B
1	0.87	0.85
2	0.83	0.86
3	0.86	0.84
4	0.85	0.83
5	0.84	0.87

Mean: A = 0.850, B = 0.850. Yet looking fold-by-fold, sometimes A wins, sometimes B wins. Even with different means, the variability of the estimates determines whether the difference is significant. We need a test that accounts for this variability.

Paired t-Test on CV Folds

The most straightforward approach: compute the difference $d_{k} = L_{A}^{(k)} - L_{B}^{(k)}$ for each fold $k$ , then test whether the mean difference $\overset{ˉ}{d}$ is significantly different from zero.

$t = \frac{d ˉ}{s _{d} / K}$

where $s_{d}$ is the standard deviation of the $d_{k}$ values and $K$ is the number of folds. Under $H_{0}$ , $t$ follows a $t$ -distribution with $K - 1$ degrees of freedom.

Problem: The fold-level estimates $d_{k}$ are not independent because training sets overlap across folds. This violates the independence assumption of the t-test, inflating the Type I error rate (falsely declaring significance).

Corrected Resampled t-Test (Nadeau & Bengio)

Nadeau & Bengio (2003) proposed a correction that adjusts the variance estimate to account for the training set overlap:

$t = \frac{d ˉ}{( \frac{1}{K} + \frac{n _{test}}{n _{train}} ) σ ^ ^{2}}$

where $n_{test}$ and $n_{train}$ are the sizes of the test and training sets in each fold, and $\overset{σ}{^}^{2}$ is the variance of the fold-level differences:

$\overset{σ}{^}^{2} = \frac{1}{K - 1} \sum_{k = 1}^{K} (d_{k} - \overset{ˉ}{d})^{2}$

This correction reduces the false positive rate substantially compared to the naive paired t-test. For $5 \times 2$ CV (5 repetitions of 2-fold CV), a specific variant known as the $5 \times 2$ CV paired t-test is recommended.

McNemar's Test (Classification)

McNemar's test operates on the predictions themselves rather than on aggregate metrics. It uses a 2x2 contingency table of how two classifiers differ on individual test examples:

	B Correct	B Incorrect
A Correct	$n_{00}$	$n_{01}$
A Incorrect	$n_{10}$	$n_{11}$

Only the discordant pairs ( $n_{01}$ and $n_{10}$ ) carry information about which model is better. Under $H_{0}$ :

$χ^{2} = \frac{( ∣ n _{01} - n _{10} ∣ - 1 ) ^{2}}{n _{01} + n _{10}}$

which follows a $χ^{2}$ distribution with 1 degree of freedom (the $- 1$ is a continuity correction).

Advantages: Does not require cross-validation (works on a single test set), avoids the independence issues of paired t-tests on CV folds, and is more powerful when the number of test examples is large.

Wilcoxon Signed-Rank Test (Non-Parametric)

When fold-level differences are not normally distributed, the Wilcoxon signed-rank test is a robust alternative. It ranks the absolute differences $∣ d_{k} ∣$ , assigns signs, and tests whether the sum of positive ranks equals the sum of negative ranks.

Procedure:

Compute differences $d_{k} = L_{A}^{(k)} - L_{B}^{(k)}$ .
Discard any $d_{k} = 0$ .
Rank the $∣ d_{k} ∣$ values from smallest to largest.
Compute $W^{+} = \sum_{ranks where d_{k} > 0} rank_{k}$ and $W^{-} = \sum_{ranks where d_{k} < 0} rank_{k}$ .
The test statistic is $W = min (W^{+}, W^{-})$ .

Under $H_{0}$ , $W$ has a known distribution. For large $K$ , a normal approximation applies.

Friedman Test (Comparing Multiple Models)

When comparing $M > 2$ models across $K$ datasets (or folds), the Friedman test is a non-parametric alternative to repeated-measures ANOVA. For each fold, models are ranked from best to worst. The test statistic evaluates whether the average ranks differ significantly:

$χ_{F}^{2} = \frac{12 K}{M ( M + 1 )} [\sum_{j = 1}^{M} R_{j}^{2} - \frac{M ( M + 1 ) ^{2}}{4}]$

where $R_{j}$ is the average rank of model $j$ .

If the Friedman test is significant, post-hoc pairwise tests determine which specific pairs differ.

Nemenyi Post-Hoc Test

After a significant Friedman test, the Nemenyi test checks all pairwise comparisons. Two models are significantly different if their average rank difference exceeds the critical difference:

$C D = q_{α} \frac{M ( M + 1 )}{6 K}$

where $q_{α}$ is the critical value from the Studentized range distribution. Results are often visualized in a critical difference diagram, where models connected by a horizontal bar are not significantly different.

Effect Sizes

Statistical significance alone can be misleading -- with enough data, even tiny differences become significant. Effect size measures the practical magnitude of the difference:

$Cohen’s d = \frac{d ˉ}{s _{d}}$

Rules of thumb: $d < 0.2$ is small, $0.2 \leq d < 0.8$ is medium, $d \geq 0.8$ is large. Always report effect sizes alongside p-values.

Why It Matters

In machine learning research and practice, model comparisons are made constantly -- on Kaggle leaderboards, in papers, and in production A/B tests. Without statistical tests, we cannot distinguish genuine improvements from noise. This leads to wasted engineering effort on "improvements" that do not generalize, or publication of results that do not replicate.

Key Technical Details

Multiple comparisons: When comparing many model pairs, apply corrections (Bonferroni, Holm) to control the family-wise error rate.
Power: McNemar's test has higher statistical power than paired t-tests on CV folds for large test sets because it uses individual predictions rather than fold aggregates.
Assumptions matter: The paired t-test assumes normally distributed differences. When in doubt, use the Wilcoxon signed-rank test.
Practical significance vs. statistical significance: A 0.1% accuracy improvement may be statistically significant on a large test set but practically meaningless. Always report effect sizes.

Common Misconceptions

"The model with higher cross-validation accuracy is always better." Without a statistical test, you cannot know whether the difference is due to chance. Fold-level variance can be substantial.
"A paired t-test on CV folds is valid." The standard paired t-test underestimates variance because CV folds share training data. Use the corrected resampled t-test or McNemar's test.
"p < 0.05 means the improvement matters." Statistical significance says the difference is unlikely due to chance, not that it is large enough to care about. Report effect sizes and confidence intervals.
"You need cross-validation for model comparison." McNemar's test works on a single train/test split and avoids the dependence issues of CV-based tests entirely.

Connections to Other Concepts

cross-validation.md: Produces the fold-level estimates that feed into paired tests.
classification-metrics.md: The metrics being compared (accuracy, RMSE, etc.).
hyperparameter-tuning.md: After tuning, statistical tests verify that the tuned model genuinely outperforms alternatives.
learning-curves.md: Provide visual context for where models differ and whether the gap is narrowing or widening.
calibration.md: Two models may have similar accuracy but very different calibration quality; compare calibration metrics too.