Gradient Boosting

One-Line Summary: Building an additive model by fitting each new tree to the residual errors of the ensemble -- the most powerful tabular method.

Prerequisites: Gradient descent, loss functions, decision trees, AdaBoost, Taylor expansion basics.

What Is Gradient Boosting?

Imagine a sculptor starting with a rough block of marble. The first pass removes large chunks to establish a coarse shape. Each subsequent pass focuses on finer details -- smoothing surfaces, refining edges, adding texture. The sculptor never starts over; each pass corrects the remaining imperfections from the previous one.

Gradient Boosting, formalized by Jerome Friedman in 2001, builds predictive models in exactly this way. It constructs an ensemble by sequentially adding models that correct the errors of the current ensemble. Each new model is fit not to the original targets but to the negative gradient of the loss function evaluated at the current predictions. This reframes ensemble construction as gradient descent in function space -- a profound insight that generalizes AdaBoost to arbitrary differentiable loss functions.

How It Works

The Gradient Boosting Framework

We seek a function $F (x)$ that minimizes a loss $L (y, F (x))$ over the training data. Gradient Boosting builds $F$ as an additive expansion:

$F_{T} (x) = F_{0} (x) + \sum_{t = 1}^{T} ν \cdot h_{t} (x)$

where $F_{0} (x)$ is an initial estimate (e.g., the mean of $y$ for squared error loss), $h_{t}$ are base learners (typically small decision trees), and $ν \in (0, 1]$ is the learning rate (shrinkage parameter).

Algorithm: Gradient Boosting Machine

Initialize: $F_{0} (x) = ar g min_{γ} \sum_{i = 1}^{n} L (y_{i}, γ)$

For $t = 1, 2, \dots, T$ :

Compute pseudo-residuals (negative gradients):

$r_{i t} = - [\frac{\partial L ( y _{i} , F ( x _{i} ))}{\partial F ( x _{i} )}]_{F = F_{t - 1}}$

Fit a base learner $h_{t} (x)$ to the pseudo-residuals ${(x_{i}, r_{i t})}_{i = 1}^{n}$ .
Compute the optimal step size (for tree-based learners, optimize within each leaf region $R_{j t}$ ):

$γ_{j t} = ar g min_{γ} \sum_{x_{i} \in R_{j t}} L (y_{i}, F_{t - 1} (x_{i}) + γ)$

Update the model:

$F_{t} (x) = F_{t - 1} (x) + ν \sum_{j = 1}^{J_{t}} γ_{j t} \cdot 1 [x \in R_{j t}]$

Gradient Descent in Function Space

The key insight -- Friedman's central contribution -- is interpreting this procedure as gradient descent in function space. In ordinary gradient descent, we update parameters: $θ \leftarrow θ - ν \nabla_{θ} L$ . In gradient boosting, we update the function itself:

$F_{t} = F_{t - 1} - ν \cdot \nabla_{F} L (y, F) ∣_{F = F_{t - 1}}$

The pseudo-residuals $r_{i t}$ are the components of the functional gradient. Since we cannot represent arbitrary functions, we project the gradient onto the space of base learners by fitting $h_{t}$ to the pseudo-residuals.

Loss Functions and Their Pseudo-Residuals

Different loss functions yield different pseudo-residuals:

Loss Function	$L (y, F)$	Pseudo-Residual $r_{i}$
Squared Error (MSE)	$\frac{1}{2} (y - F)^{2}$	$y_{i} - F (x_{i})$ (actual residuals)
Absolute Error (MAE)	$	y - F
Huber Loss	Hybrid	Residual if small, sign if large
Log Loss (classification)	$lo g (1 + e^{- y F})$	$y_{i} - σ (F (x_{i}))$ where $σ$ is sigmoid

For squared error, the pseudo-residuals are literally the residuals, which gives the intuitive "fit trees to residuals" description. For other losses, the pseudo-residuals point in the direction of steepest descent for each observation.

Key Hyperparameters

Learning rate (shrinkage) $ν$ : Controls the contribution of each tree. Smaller values (0.01--0.1) require more trees but produce better generalization through regularization. Friedman showed empirically that $ν \leq 0.1$ almost always improves test performance.

Tree depth (interaction depth) $J$ : Controls the complexity of each base learner and the order of feature interactions. A tree with $J$ leaves can capture at most $(J - 1)$ -way interactions:

$J = 2$ : Stumps, additive model only (no interactions)
$J = 4$ -- $8$ : Typical range, captures moderate interactions
Higher $J$ : More complex interactions but higher variance per tree

Number of trees $T$ : Must be tuned via cross-validation. Unlike Random Forests, gradient boosting can overfit with too many trees.

Stochastic Gradient Boosting

Friedman (2002) proposed training each tree on a random subsample of the training data (without replacement), typically 50--80%. This introduces randomness analogous to stochastic gradient descent:

Reduces computation per iteration
Acts as regularization, reducing overfitting
Often improves generalization over deterministic gradient boosting
Enables out-of-bag error estimation similar to bagging

Why It Matters

Gradient Boosting is arguably the most successful algorithm for structured/tabular data. It dominates machine learning competitions on platforms like Kaggle, powers critical systems in industry (fraud detection, recommendation, ranking, risk scoring), and provides a flexible framework that accommodates virtually any differentiable loss function. Its formulation as functional gradient descent provides deep theoretical grounding and connects ensemble learning to optimization theory. The modern implementations -- XGBoost, LightGBM, and CatBoost -- have made gradient boosting the default choice for tabular machine learning.

Key Technical Details

Bias reduction: Unlike bagging, gradient boosting primarily reduces bias. Each new tree corrects systematic errors in the current ensemble.
Regularization is critical: Without shrinkage ( $ν < 1$ ), subsampling, or depth limits, gradient boosting easily overfits. The interplay between $ν$ and $T$ is key: smaller $ν$ requires larger $T$ .
Feature importance: Like Random Forests, gradient boosting provides impurity-based feature importance summed across all trees in the ensemble.
Monotone constraints: Many implementations allow enforcing monotonic relationships between features and predictions, useful for regulated industries.
Early stopping: Monitor validation loss and stop adding trees when it begins to increase. This is more principled than fixing $T$ in advance.

Common Misconceptions

"Gradient boosting fits trees to residuals." This is only true for squared error loss. For general losses, the trees are fit to pseudo-residuals (negative gradients), which are not the same as actual residuals. The "residual" intuition is a useful simplification but technically imprecise.
"A lower learning rate is always better." While smaller $ν$ generally improves generalization, it requires proportionally more trees, increasing training time. There is a practical optimum where the computational budget is balanced against the regularization benefit.
"Gradient boosting cannot handle categorical features." While the base algorithm uses decision trees that naturally handle ordered splits, standard implementations require encoding categoricals. However, CatBoost was specifically designed to handle categorical features natively.
"More trees always improve the model." Unlike Random Forests, gradient boosting can and does overfit with excessive trees. The training loss continues to decrease, but test loss eventually rises. Early stopping or cross-validation is essential.

Connections to Other Concepts

adaboost.md: Gradient boosting with exponential loss recovers AdaBoost. Gradient boosting generalizes AdaBoost to any differentiable loss, making it more robust (e.g., log loss is less sensitive to outliers than exponential loss).
bagging-and-bootstrap.md: Stochastic gradient boosting borrows the subsampling idea from bagging. However, bagging reduces variance while boosting reduces bias -- they attack different problems.
random-forests.md: The main alternative for tabular data. Random Forests are easier to tune and parallelize; gradient boosting generally achieves lower error with careful tuning.
xgboost-lightgbm-catboost.md: Industrial implementations that add regularization, approximate splitting, and engineering optimizations to the gradient boosting framework.
stacking-and-blending.md: Gradient boosting models are frequently used as base learners in stacking ensembles, often paired with Random Forests and linear models for diversity.