Ensembles - Weak Learners That Combine Into Something Stronger
Helpful context:
- Decision Trees - Splitting the Data Until the Answer Emerges
- Gradient Descent - Follow the Slope Until the Ground Levels Off
Imagine asking a hundred people who each have mediocre knowledge of wine to each guess the vintage of a bottle. No single person gets it right, but if their errors are independent and roughly symmetric around the true answer, the average guess is surprisingly accurate. This is not a fluke. It is a mathematical consequence of variance canceling when you average independent estimates. The same principle applies to machine learning models: a collection of weak, individually unreliable models can combine into an ensemble that outperforms any single strong model. This intuition has been formalized into some of the most practically effective algorithms in all of machine learning.
The formal framework is the bias-variance decomposition. For a regression model $\hat{f}$ trained on a dataset $D$, the expected squared error at a point $x$ decomposes as:
$$\mathbb{E}_D[(y - \hat{f}(x))^2] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2$$
where the bias measures systematic error (does the model get the right answer on average?), the variance measures sensitivity to the training set (does the model change dramatically with different data?), and $\sigma^2$ is irreducible noise. Deep decision trees have low bias but high variance: they fit training data nearly perfectly but change wildly with different samples. Ensemble methods reduce variance by aggregating many such trees. Boosting reduces bias by sequentially correcting errors. The two main families, bagging and boosting, attack different parts of this decomposition.
The practical payoff has been enormous. Random Forests and Gradient Boosted Trees (especially XGBoost) are consistently among the top performers on tabular data benchmarks, competitive with or superior to deep neural networks on structured data where the sample size is in the thousands to low millions. Understanding why they work so well requires tracing through the mechanics carefully.
Bagging: Variance Reduction through Bootstrap Aggregation
Bagging (Bootstrap AGGregating, Breiman 1996) is the simplest ensemble method. The idea is to train $T$ models independently on different random subsets of the training data, then average their predictions.
The subsets are drawn by bootstrap sampling: for a training set of $n$ examples, each bootstrap sample draws $n$ examples uniformly at random with replacement. On average, each bootstrap sample contains about 63% of the unique training examples (since $(1 - 1/n)^n \to 1/e \approx 0.368$ of examples are excluded from any given sample). The excluded examples, called the out-of-bag (OOB) sample, can be used to estimate generalization error without a separate validation set.
For regression, the ensemble prediction is the average:
$$\hat{f}(x) = \frac{1}{T} \sum_{t=1}^{T} \hat{f}_t(x)$$
For classification, the ensemble typically uses majority vote across the $T$ classifiers.
Why does averaging reduce variance? If $T$ models each have variance $\sigma^2$ and their errors are uncorrelated, the average has variance $\sigma^2 / T$. The variance shrinks to zero as $T \to \infty$. In practice, the models are trained on overlapping bootstrap samples and their predictions are positively correlated, so the actual reduction is:
$$\text{Var}\left[\frac{1}{T}\sum_t \hat{f}_t\right] = \rho \sigma^2 + \frac{1-\rho}{T}\sigma^2$$
where $\rho$ is the average pairwise correlation between model predictions. As $T \to \infty$, the second term vanishes, but the first term $\rho \sigma^2$ remains. High correlation between models sets a floor on ensemble variance. This is why de-correlating the base models matters, and it motivates the key innovation in Random Forests.
Random Forests: De-correlating the Trees
Random Forests (Breiman, 2001) are bagged decision trees with one critical modification: at each node during tree construction, only a random subset of $m$ features is considered as candidates for splitting, rather than all $p$ features.
Typically $m = \lfloor\sqrt{p}\rfloor$ for classification and $m = \lfloor p/3 \rfloor$ for regression. The choice of $m$ is the primary hyperparameter of a Random Forest.
Why does this help? Consider a dataset where one feature is strongly predictive and the others are weaker. In a standard bagged forest, every tree would put the strong feature near its root on almost every bootstrap sample, producing highly correlated trees. The ensemble variance floor $\rho \sigma^2$ would remain high because $\rho$ is close to 1. By excluding the dominant feature from consideration at many nodes with probability $1 - m/p$, feature subsampling forces different trees to rely on different features, reducing $\rho$ and lowering the variance floor.
Training a Random Forest is straightforwardly parallelizable: each tree is trained independently on its bootstrap sample. Prediction requires running the input through all $T$ trees and aggregating, which is also embarrassingly parallel. This makes Random Forests practical at scale.
Out-of-bag error. For each training example, the trees that did not include it in their bootstrap sample form a natural holdout ensemble. Averaging only those trees' predictions for that example gives the OOB prediction. The OOB error is an unbiased estimate of generalization error and typically tracks cross-validation error closely, without requiring a separate validation set or multiple training runs.
AdaBoost: Reweighting for the Mistakes
Boosting takes a fundamentally different approach. Rather than building independent models in parallel and averaging, boosting builds models sequentially, with each new model focusing on the examples that earlier models got wrong.
AdaBoost (Freund and Schapire, 1995) maintains a weight $w_i$ for each training example $i$, initialized uniformly to $1/n$. At each round $t$:
- Train a weak classifier $h_t$ on the weighted training set (examples with higher $w_i$ are more influential).
- Compute the weighted error of $h_t$: $\epsilon_t = \sum_i w_i \cdot \mathbf{1}[h_t(x_i) \neq y_i]$.
- Compute the classifier weight: $\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t}$.
- Update the example weights: increase $w_i$ for misclassified examples by a factor of $e^{\alpha_t}$, decrease $w_i$ for correctly classified examples by $e^{-\alpha_t}$, then renormalize.
The final prediction is the weighted vote:
$$H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right)$$
The effect is that after each round, misclassified examples receive higher weight, so the next weak learner is forced to pay more attention to the hard cases. Each $\alpha_t$ reflects the reliability of that round’s classifier: a classifier barely better than random (small $\alpha_t$) gets low weight; a very accurate classifier (large $\alpha_t$) gets high weight.
A key theoretical result: if each weak learner has training error at most $0.5 - \gamma$ (i.e., better than random by a margin $\gamma$), then the training error of the ensemble decreases exponentially as $T$ increases, at rate $e^{-2\gamma^2 T}$. Boosting can drive training error to zero quickly. Generalization is governed by a separate analysis using the margin theory.
Gradient Boosting: Fitting Residuals as Gradient Descent
AdaBoost has an elegant interpretation: it performs gradient descent in function space. Gradient Boosting (Friedman, 2001) makes this interpretation explicit and generalizes it to arbitrary differentiable loss functions.
Let $F(x)$ be the ensemble model and $\mathcal{L}(y, F(x))$ be a loss function. Gradient Boosting builds $F$ additively:
$$F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$$
where $\eta$ is a learning rate (shrinkage) and $h_m$ is the $m$-th base learner, a small decision tree. The key question is what $h_m$ should approximate. The answer is: the negative gradient of the loss evaluated at the current ensemble predictions.
Define the pseudo-residuals at step $m$:
$$r_i^{(m)} = -\left.\frac{\partial \mathcal{L}(y_i, F(x_i))}{\partial F(x_i)}\right|{F = F{m-1}}$$
The new tree $h_m$ is fitted to these pseudo-residuals: minimize $\sum_i (r_i^{(m)} - h_m(x_i))^2$. The tree is then added to the ensemble, and the process repeats.
For squared-error loss $\mathcal{L}(y, F) = \frac{1}{2}(y - F)^2$, the negative gradient is simply $r_i = y_i - F_{m-1}(x_i)$, the ordinary residual. This is exactly residual fitting: each tree corrects the errors of the current ensemble. For other loss functions, the pseudo-residuals generalize this notion. For log-loss (binary classification), the gradient involves the probability estimates of the current model, automatically downweighting examples the model is already confident about. For absolute loss $\mathcal{L}(y, F) = |y - F|$, the pseudo-residuals are signs, producing a robust form of boosting less sensitive to outliers than squared-error boosting.
The learning rate $\eta$ is a shrinkage parameter that scales each tree’s contribution. Smaller $\eta$ means more trees are needed but typically gives better generalization: slower learning lets each tree correct only a small portion of the current error, preventing any single tree from dominating. The practical recommendation is to use small $\eta$ (0.01 to 0.1) and early stopping on a validation set to choose $T$.
XGBoost: Second-Order Updates and Regularization
Vanilla Gradient Boosting uses only the first-order gradient (the pseudo-residuals) to determine what each tree should fit. XGBoost (Chen and Guestrin, 2016) uses the second derivative of the loss as well, making it a Newton-step method in function space.
At step $m$, define for each example:
$$g_i = \frac{\partial \mathcal{L}(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)}, \qquad h_i = \frac{\partial^2 \mathcal{L}(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)^2}$$
The second-order approximation of the objective with respect to the new tree’s outputs gives a closed-form optimal leaf weight for each leaf $j$:
$$w_j^* = -\frac{\sum_{i \in \text{leaf}j} g_i}{\sum{i \in \text{leaf}_j} h_i + \lambda}$$
where $\lambda$ is an L2 regularization term on leaf weights. The quality score for a tree structure is also derived in closed form, so XGBoost can evaluate candidate splits analytically. An L1 penalty $\alpha$ on leaf weights further promotes sparsity.
Additional improvements in XGBoost over standard GBM:
- Column (feature) subsampling: like Random Forests, XGBoost can subsample features at each split, reducing overfitting and improving speed.
- Approximate quantile sketching: rather than sorting all data to find every possible split threshold, XGBoost uses weighted quantile sketches to find candidate thresholds in approximate but efficient fashion.
- Sparsity awareness: XGBoost learns a default direction for missing values at each split, handling sparse features efficiently without imputation.
- Cache-aware block structure: data is stored in compressed block format optimized for cache access patterns during tree construction.
These improvements together make XGBoost substantially faster and often more accurate than vanilla GBM, particularly on large datasets.
LightGBM: The Speed Champion for Tabular Data
XGBoost made gradient boosting practical at scale. LightGBM (Ke et al., Microsoft, 2017) pushed speed further by rethinking how trees are built. The result is often 10-20x faster than XGBoost with comparable or better accuracy - which is why LightGBM dominates Kaggle tabular competitions and industrial deployments.
Two core algorithmic innovations drive the speedup:
Leaf-wise (best-first) tree growth: standard gradient boosting grows trees level by level - all nodes at depth $d$ are split before any node at depth $d+1$. LightGBM instead always splits the leaf with the largest loss reduction, regardless of depth. A single path through the tree may grow very deep while other branches stay shallow. This finds more complex patterns with fewer total leaves, producing better trees at the same computational budget. The risk: leaf-wise growth can overfit on small datasets because individual leaves may end up with few examples. Use num_leaves and min_data_in_leaf to control this.
GOSS (Gradient-based One-Side Sampling): in a boosted tree at step $m$, examples with large gradients $|g_i|$ are the ones the current model is getting wrong - they are the most informative for the next tree. Examples with small gradients are already well-predicted. GOSS keeps all high-gradient examples and randomly samples a fraction of low-gradient examples, then upweights the sampled low-gradient examples by a correction factor so the gradient statistics remain unbiased. This reduces the number of examples used for each split computation without hurting accuracy much, since low-gradient examples contribute little information.
EFB (Exclusive Feature Bundling): high-dimensional data often has many sparse features that are rarely non-zero simultaneously (think one-hot encoded categoricals). EFB bundles such mutually exclusive features into a single feature, reducing the effective feature count and making each split computation cheaper.
Native categorical support: LightGBM can handle categorical features directly without one-hot encoding by finding the optimal split over category values using a specialized algorithm. This is both faster and often more accurate than one-hot encoding, which can produce thousands of sparse columns.
import lightgbm as lgb
model = lgb.LGBMClassifier(
n_estimators=1000,
learning_rate=0.05,
num_leaves=31, # controls tree complexity; lower = less overfitting
min_child_samples=20, # minimum examples per leaf
subsample=0.8, # row subsampling (like XGBoost's subsample)
colsample_bytree=0.8, # feature subsampling per tree
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)
Pros:
- 10-20x faster than XGBoost on large datasets.
- Lower memory: histogram-based splits mean LightGBM bins continuous features upfront rather than sorting full arrays at each split.
- Handles large datasets natively without approximate sketches.
- Native categorical support avoids preprocessing overhead.
Cons:
- Leaf-wise growth is more prone to overfitting on small datasets (fewer than a few thousand examples). Use XGBoost or Random Forest there.
num_leavesis the most important hyperparameter and its right value varies widely across datasets. Start withnum_leaves=31and tune.- The speed makes it easy to overfit by running too many iterations - always use early stopping.
Choosing between XGBoost and LightGBM: for large datasets (100K+ rows), default to LightGBM for speed. For small datasets or when you need maximum control over overfitting, XGBoost is safer. Both should be in your toolkit - they often produce diverse predictions that improve when stacked or ensembled together.
Feature Importance
Tree ensembles provide natural feature importance estimates. The two main approaches often disagree, and understanding why is important.
Mean decrease impurity (MDI) is computed during training. For each feature, sum the impurity reduction it causes across all splits in all trees, weighted by the number of training examples passing through each split. This is fast, always available, and interpretable as “how much did this feature help build the trees?”
The problem with MDI is that it is biased toward high-cardinality features. A continuous feature with many possible split thresholds has more opportunities to appear in splits and accumulates more impurity reduction by chance. A feature with only two distinct values can appear in at most one split per path, while a continuous feature can appear at every level. MDI overestimates the importance of continuous and high-cardinality categorical features.
Permutation importance avoids this bias. After training, for each feature $j$: randomly shuffle the values of feature $j$ in the validation set (breaking its relationship with the target), run the shuffled data through the trained ensemble, and measure the increase in validation error compared to the unshuffled baseline. A large increase means the model relied heavily on feature $j$; a small increase means feature $j$ was not important for validation performance.
Permutation importance is unbiased with respect to cardinality and measures what matters for generalization. Its disadvantage is that it requires a held-out dataset and is slower to compute. When MDI and permutation importance disagree, trust permutation importance.
Stacking: Using Predictions as Features
Stacking (Wolpert, 1992) is a more general ensemble technique. Rather than averaging predictions, you train a meta-learner whose inputs are the predictions of the base models.
The procedure: train $T$ base models on the training data. Use cross-validation to generate out-of-fold predictions from each base model for every training example (to avoid overfitting: each training example’s prediction comes from a model that did not see that example). Stack these $T$ prediction columns as features for the meta-learner, and train the meta-learner on the original training labels. At test time, run the test data through all $T$ base models, stack their predictions, and feed the result to the meta-learner.
The meta-learner learns how to combine the base models: it discovers that certain base models are more reliable on certain regions of input space, or that some models' errors are correlated while others are independent. In Kaggle competitions, stacking 5-10 diverse base models (Random Forest, GBM, logistic regression, neural net, k-NN) with a simple linear or logistic regression meta-learner consistently improves over any single base model.
When Ensembles Fail
Ensembles reduce variance but cannot reduce bias. If all base models share a systematic error, averaging them amplifies, not cancels, that error. The most common failure mode is shared data bias: if all models are trained on the same biased dataset (e.g., training data that underrepresents certain demographic groups), the ensemble inherits and can entrench that bias. More models does not help when the problem is in the data.
A second failure mode is feature leakage: if the training data contains a feature that encodes the target (e.g., a timestamp that correlates with outcome due to temporal trend), all trees in the ensemble will use it, and ensemble importance metrics will correctly flag it as the most important feature. But on new data where the leakage feature is unavailable or uncorrelated, the ensemble will fail. Ensembles are not a substitute for careful data validation.
Third: distributional shift. Ensembles are excellent at interpolating within the training distribution. They are not necessarily better than single models at extrapolating to out-of-distribution inputs.
Summary
| Method | Strategy | Reduces | Key hyperparameters |
|---|---|---|---|
| Bagging | Bootstrap samples, average predictions | Variance | Number of trees $T$ |
| Random Forest | Bagging + feature subsampling at each split | Variance, decorrelates trees | $T$, $m$ (features per split) |
| AdaBoost | Sequential, reweight misclassified examples | Bias (weak learners toward hard examples) | $T$, weak learner depth |
| GBM | Sequential, fit trees to negative gradient of loss | Bias, flexible loss | $T$, $\eta$, tree depth |
| XGBoost | GBM + second-order updates + regularization + approx. quantile splits | Bias + overfitting | $T$, $\eta$, $\lambda$, $\alpha$, column subsample |
Read next: