Random Forest
One decision tree is a brittle thing. A hundred slightly-different trees, voting, is a remarkably sturdy classifier. The trick that makes the difference is randomness.
The five-bullet version
- A decision tree splits the data on one feature at a time, learning a tree of yes/no questions to reach a label.
- One tree overfits — it memorizes its training set, including the noise.
- Random Forest grows hundreds of trees, each on a bootstrap sample of the data, each only allowed to consider a random subset of features at every split.
- At inference, every tree votes; majority wins (for classification) or average (for regression).
- Still a strong baseline in 2026 — interpretable, tabular-friendly, no GPU needed.
§ 00 · DECISION TREES, IN ONE BREATHSplitting on one feature at a time
A decision treedecision tree. A predictor that asks a series of yes/no questions about features ('is age > 30?', 'is income > $50k?') to reach a leaf, where it outputs a predicted class or value. Trained by greedily picking the split at each node that best separates the classes. is a series of yes/no questions about your features. “Is age > 30? Is income > $50k?” Each question splits the data; the algorithm picks the question that best separates the classes at every step.
The tree is built top-down, greedily. At each node, try every possible split and pick the one that maximally reduces impurity (Gini or entropy). Keep going until you hit a stopping rule — max depth, minimum leaf size, or pure leaves.
§ 01 · WHY ONE TREE OVERFITSBrittleness as a feature, not a bug
A single deep tree is famously brittle. Move one training point and the splits change. Three problems together:
- High variance. Small perturbations to the data cause large structural changes to the tree.
- Greedy.Picking the best split at each node doesn’t guarantee the best overall tree.
- Overfit-friendly. A tree deep enough can memorize any training set — including the noise.
§ 02 · BAGGING + FEATURE RANDOMNESSTwo random tricks, stacked
Random Forest, from Leo Breiman in 2001, takes the brittle-tree problem and turns it into an advantage. Build many trees. Train each one on a slightly different version of the dataset. Make sure no two trees see the same features at the same split. Then vote.
Two specific tricks:
- Baggingbagging. Bootstrap Aggregating. Each tree is trained on a bootstrap sample — a random sample of the same size as the original dataset, drawn with replacement. Some examples appear multiple times; some never appear. (bootstrap aggregating). Each tree is trained on a different bootstrap sample of the dataset. The bootstraps are randomly drawn with replacement, so each tree sees a slightly different slice.
- Feature subsampling at each split. When choosing the best split at a node, the tree is only allowed to consider a random subset of features (typically
√(total features)for classification). Prevents every tree from picking the same dominant feature first.
Together: every tree is trained on different data, with different features available at every split. They make different mistakes. Their average is more stable than any individual.
§ 03 · THE FOREST VOTESAggregation as denoising
At inference time, send the input to every tree. Each tree gives a prediction. For classification, take the majority. For regression, take the mean.
Switch scenarios. When the trees agree, the vote is confident. When they split, the vote is uncertain — and that uncertainty signal is meaningful: 100 trees voting 60/40 means the forest isn’t sure. When two trees go wrong, the rest cover for them — the majority is usually right.
Three nice statistical consequences:
- Variance reduction.Averaging N independent estimators reduces variance by a factor of N (when uncorrelated). Real trees aren’t fully independent, but the feature randomness brings them closer than vanilla bagging would.
- Out-of-bag estimates. Each tree was trained on a bootstrap, so ~37% of the original data was held out for each tree. Use those held-out examples to estimate test error for free — no separate validation set needed.
- Feature importance. Across all trees, count how much each feature reduced impurity. The most-useful features become obvious — a free interpretability tool.
§ 04 · WHERE IT SHINES, WHERE IT DOESN’TKnowing when to reach for it
Random Forest’s sweet spot:
- Tabular data. Mixed categorical and numeric features, modest dataset sizes (thousands to millions of rows). Random Forest and gradient boosting (XGBoost, LightGBM) remain the default winners on this kind of data in 2026.
- Quick baselines. Drops in for any classification or regression task with minimal tuning. Often hits 90% of the quality of a more carefully tuned model.
- Interpretability requirements. Feature importance gives you an honest answer about what the model is keying on.
Where Random Forest struggles:
- Sequential or temporal data— no innate notion of order. You can hand-engineer lag features, but it’s clumsy.
- Very high-dimensional sparse features (text in raw word-count form, image pixels). Other methods do better here.
- Extrapolation— trees can’t predict outside the range of training values for a regression target. They’ll flatten at the boundary instead.
§ 05 · TAKING THIS FORWARDWhere to look next
The natural next step after Random Forest is gradient boosting (XGBoost, LightGBM, CatBoost). Same building block (decision tree), different ensemble strategy: instead of independent trees voting, trees are built sequentially, each correcting the errors of the previous. Usually a few points more accurate at the cost of being slower to train and tune. Both are worth keeping in the toolbox for any tabular-data problem.
§ · GOING DEEPERBagging, feature subsampling, and gradient boosting cousins
Random forests work for two reasons that compound. Bagging (Breiman 1996) — bootstrap sampling the training set per tree — reduces variance: each tree sees a slightly different dataset and makes different mistakes, so averaging cancels them.Random feature subsamplingat each split decorrelates trees further: even when one feature is dominant, most trees won’t use it at their root, so the ensemble explores more of the decision space.
The gradient-boosted cousins (XGBoost, LightGBM, CatBoost) are usually stronger than vanilla random forests on tabular data in 2026, because they fit trees sequentially to the residual of the previous ensemble — Friedman’s gradient boosting framing (2001). For tabular ML, “just use XGBoost” remains a defensible default. Random forests still shine when you want robust quick baselines, feature importance estimates, or out-of-bag evaluation without a held-out set.
§ · FURTHER READINGReferences & deeper sources
- (2001). Random Forests · Machine Learning
- (1996). Bagging Predictors · Machine Learning
- (2001). Greedy Function Approximation: A Gradient Boosting Machine · Annals of Statistics
- (2016). XGBoost: A Scalable Tree Boosting System · KDD
- (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree · NeurIPS
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.