Core Concepts · Module 15·9 min read

Random Forest

One decision tree is a brittle thing. A hundred slightly-different trees, voting, is a remarkably sturdy classifier. The trick that makes the difference is randomness.

The five-bullet version

  • A decision tree splits the data on one feature at a time, learning a tree of yes/no questions to reach a label.
  • One tree overfits — it memorizes its training set, including the noise.
  • Random Forest grows hundreds of trees, each on a bootstrap sample of the data, each only allowed to consider a random subset of features at every split.
  • At inference, every tree votes; majority wins (for classification) or average (for regression).
  • Still a strong baseline in 2026 — interpretable, tabular-friendly, no GPU needed.

§ 00 · DECISION TREES, IN ONE BREATHSplitting on one feature at a time

A decision treedecision tree. A predictor that asks a series of yes/no questions about features ('is age > 30?', 'is income > $50k?') to reach a leaf, where it outputs a predicted class or value. Trained by greedily picking the split at each node that best separates the classes. is a series of yes/no questions about your features. “Is age > 30? Is income > $50k?” Each question splits the data; the algorithm picks the question that best separates the classes at every step.

The tree is built top-down, greedily. At each node, try every possible split and pick the one that maximally reduces impurity (Gini or entropy). Keep going until you hit a stopping rule — max depth, minimum leaf size, or pure leaves.

§ 01 · WHY ONE TREE OVERFITSBrittleness as a feature, not a bug

A single deep tree is famously brittle. Move one training point and the splits change. Three problems together:

Tree A — root splits on x₃x₃ < 2.1Tree B — root splits on x₇x₇ < 0.8Tree C — root splits on x₂x₂ < 1.5Same training distribution, different bootstraps and feature subsets — three different decision boundaries.
Fig 1The classic decision-tree brittleness. The single-tree fix would be aggressive pruning. The ensemble fix is to embrace the variance and average it out.

§ 02 · BAGGING + FEATURE RANDOMNESSTwo random tricks, stacked

Random Forest, from Leo Breiman in 2001, takes the brittle-tree problem and turns it into an advantage. Build many trees. Train each one on a slightly different version of the dataset. Make sure no two trees see the same features at the same split. Then vote.

Two specific tricks:

Together: every tree is trained on different data, with different features available at every split. They make different mistakes. Their average is more stable than any individual.

§ 03 · THE FOREST VOTESAggregation as denoising

At inference time, send the input to every tree. Each tree gives a prediction. For classification, take the majority. For regression, take the mean.

Lab · the forest voting5 trees, 1 input · different scenarios
tree 1
A
tree 2
A
tree 3
A
tree 4
A
tree 5
A
Final prediction (majority vote)A· 5× A

Switch scenarios. When the trees agree, the vote is confident. When they split, the vote is uncertain — and that uncertainty signal is meaningful: 100 trees voting 60/40 means the forest isn’t sure. When two trees go wrong, the rest cover for them — the majority is usually right.

Three nice statistical consequences:

§ 04 · WHERE IT SHINES, WHERE IT DOESN’TKnowing when to reach for it

Random Forest’s sweet spot:

Where Random Forest struggles:

CHECKA team is choosing between a deep neural net and a Random Forest for a fraud detection model on tabular bank-transaction data (~500k rows, 30 features). What's the most likely outcome?

§ 05 · TAKING THIS FORWARDWhere to look next

The natural next step after Random Forest is gradient boosting (XGBoost, LightGBM, CatBoost). Same building block (decision tree), different ensemble strategy: instead of independent trees voting, trees are built sequentially, each correcting the errors of the previous. Usually a few points more accurate at the cost of being slower to train and tune. Both are worth keeping in the toolbox for any tabular-data problem.

§ · GOING DEEPERBagging, feature subsampling, and gradient boosting cousins

Random forests work for two reasons that compound. Bagging (Breiman 1996) — bootstrap sampling the training set per tree — reduces variance: each tree sees a slightly different dataset and makes different mistakes, so averaging cancels them.Random feature subsamplingat each split decorrelates trees further: even when one feature is dominant, most trees won’t use it at their root, so the ensemble explores more of the decision space.

The gradient-boosted cousins (XGBoost, LightGBM, CatBoost) are usually stronger than vanilla random forests on tabular data in 2026, because they fit trees sequentially to the residual of the previous ensemble — Friedman’s gradient boosting framing (2001). For tabular ML, “just use XGBoost” remains a defensible default. Random forests still shine when you want robust quick baselines, feature importance estimates, or out-of-bag evaluation without a held-out set.

§ · FURTHER READINGReferences & deeper sources

  1. Breiman (2001). Random Forests · Machine Learning
  2. Breiman (1996). Bagging Predictors · Machine Learning
  3. Friedman (2001). Greedy Function Approximation: A Gradient Boosting Machine · Annals of Statistics
  4. Chen, Guestrin (2016). XGBoost: A Scalable Tree Boosting System · KDD
  5. Ke et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree · NeurIPS

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.