Decision Trees - Splitting the Data Until the Answer Emerges // Megha Bose

Helpful context:

Suppose you are a doctor trying to decide whether a patient has a certain disease. You do not run a single blood test and call it done. You ask questions in sequence: Is the patient over 50? If yes, do they smoke? If yes, do they have a persistent cough? Each answer narrows the population you are reasoning about, until you reach a conclusion with enough confidence to act on. This is not a statistical formula; it is a tree of decisions. And it turns out that formalizing this process precisely, with the right choice of which question to ask at each step, gives one of the most interpretable and surprisingly powerful models in all of machine learning.

A decision tree partitions the feature space recursively. You start with all your training examples at the root. You pick one feature and one threshold, and split the examples into two groups: those where the feature is below the threshold go left, those above go right. You then repeat the process independently on each group, asking a new question on a new feature. You keep splitting until some stopping criterion is met, at which point each terminal node, called a leaf, assigns a label: the majority class among all training examples that ended up there. The resulting model is a binary tree whose internal nodes each test one feature, whose branches encode the outcome of that test, and whose leaves emit predictions. Crucially, the decision boundary this tree defines in feature space is piecewise axis-aligned: every split is parallel to one of the coordinate axes. The boundary between predicted classes is made of horizontal and vertical lines in 2D, or hyperplane slabs parallel to coordinate hyperplanes in higher dimensions.

Why does this structure matter? Because it is transparent. You can print a small tree and a human can read it. You can trace every prediction back through the sequence of tests that produced it. The model handles a mix of categorical and numerical features naturally, does not require any feature scaling or normalization, and is robust to irrelevant features, which simply never appear in the tree if they carry no useful signal. These advantages come at a cost, which we will address: trees overfit aggressively when grown deep, and their axis-aligned boundaries struggle with relationships that are fundamentally diagonal in feature space. But they are the essential primitive on which the most powerful ensemble methods in practice, Random Forests and Gradient Boosting, are built.

The Splitting Criterion: Entropy and Information Gain

The central algorithmic question in building a decision tree is: at a given node containing a set $S$ of training examples, which feature should we split on, and at what threshold?

We need a measure of the quality of a set of examples. If all examples in $S$ belong to the same class, the node is pure and we need ask no further questions. If the examples are evenly divided among many classes, the node is maximally impure and a split would help. The right measure comes from information theory.

Define the entropy of a set $S$ as:

$$H(S) = -\sum_{c} p_c \log_2 p_c$$

where $p_c$ is the proportion of examples in $S$ belonging to class $c$, and the sum runs over all classes present. Entropy is zero when $S$ is pure (all one class, so one $p_c = 1$ and all others are 0, and $\log_2 1 = 0$), and is maximized when all classes are equally represented. For a binary classification problem with classes positive and negative, entropy is $H = -p \log_2 p - (1-p)\log_2(1-p)$, which peaks at 1 bit when $p = 0.5$ and falls to 0 at $p = 0$ or $p = 1$.

Given a candidate split on feature $A$ that divides $S$ into subsets $\{S_v\}$ indexed by feature value $v$, the information gain of the split is:

$$\text{IG}(S, A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)$$

The first term is the entropy before the split. The second term is the weighted average entropy of the resulting subsets, weighted by the fraction of examples in each subset. Information gain measures how much the split reduces uncertainty. We choose the feature $A$ with the highest information gain. This criterion is the basis of the ID3 algorithm and its successor C4.5.

A worked example. Suppose we have 10 training examples at a node: 6 belong to class A and 4 to class B. The entropy before any split is:

$$H(S) = -\frac{6}{10}\log_2\frac{6}{10} - \frac{4}{10}\log_2\frac{4}{10} = -0.6 \cdot (-0.737) - 0.4 \cdot (-1.322) \approx 0.971 \text{ bits}$$

Now consider a feature that splits $S$ into a left child $S_L$ with 5 examples (4 class A, 1 class B) and a right child $S_R$ with 5 examples (2 class A, 3 class B). The entropies of the children are:

$$H(S_L) = -\frac{4}{5}\log_2\frac{4}{5} - \frac{1}{5}\log_2\frac{1}{5} \approx 0.722 \text{ bits}$$

$$H(S_R) = -\frac{2}{5}\log_2\frac{2}{5} - \frac{3}{5}\log_2\frac{3}{5} \approx 0.971 \text{ bits}$$

The weighted child entropy is $\frac{5}{10}(0.722) + \frac{5}{10}(0.971) = 0.847$ bits. So the information gain of this split is $0.971 - 0.847 = 0.124$ bits. If a different feature produced a split with left child being all class A (4 examples) and right child being 2 A and 4 B, the left child would have entropy 0 and the right child would have entropy $-\frac{2}{6}\log_2\frac{2}{6} - \frac{4}{6}\log_2\frac{4}{6} \approx 0.918$ bits. The weighted child entropy would be $\frac{4}{10}(0) + \frac{6}{10}(0.918) = 0.551$ bits, and the information gain would be $0.971 - 0.551 = 0.420$ bits. We would choose the second feature.

Limitation of information gain. Information gain has a known bias: it tends to favor features with many distinct values. A feature that gives every example its own unique value (like a customer ID) would split $S$ into $n$ singleton subsets, each with entropy 0, yielding maximum gain – but the resulting tree would be useless on new data. The C4.5 algorithm corrects for this with the gain ratio, which divides information gain by the intrinsic information of the split (the entropy of the split sizes themselves), penalizing splits that fragment the data into many small pieces.

The Splitting Criterion: Gini Impurity

The CART algorithm (Classification and Regression Trees) uses a different impurity measure called Gini impurity:

$$\text{Gini}(S) = 1 - \sum_{c} p_c^2$$

Gini impurity has a natural probabilistic interpretation: it is the probability that a randomly chosen example would be misclassified if it were labeled according to the class distribution of $S$. When $S$ is pure, $p_c = 1$ for one class and the Gini is $1 - 1 = 0$. For a binary problem with a 50/50 split, Gini is $1 - (0.5^2 + 0.5^2) = 0.5$, its maximum.

The CART algorithm chooses the split that minimizes the weighted Gini impurity of the two children:

$$\text{Gini split cost} = \frac{|S_L|}{|S|}\text{Gini}(S_L) + \frac{|S_R|}{|S|}\text{Gini}(S_R)$$

The same worked example under Gini. With 6 class A and 4 class B in $S$, the parent Gini is $1 - (0.6^2 + 0.4^2) = 1 - (0.36 + 0.16) = 0.48$. For the first split, $S_L$ has 4 class A and 1 class B: $\text{Gini}(S_L) = 1 - (0.8^2 + 0.2^2) = 1 - 0.68 = 0.32$. And $S_R$ has 2 class A and 3 class B: $\text{Gini}(S_R) = 1 - (0.4^2 + 0.6^2) = 1 - 0.52 = 0.48$. The weighted Gini is $0.5(0.32) + 0.5(0.48) = 0.40$. For the second split, $S_L$ has 4 class A and 0 class B: $\text{Gini}(S_L) = 0$. And $S_R$ has 2 class A and 4 class B: $\text{Gini}(S_R) = 1 - \bigl((1/3)^2 + (2/3)^2\bigr) = 1 - (1/9 + 4/9) = 4/9 \approx 0.444$. The weighted Gini is $\frac{4}{10}(0) + \frac{6}{10}(0.444) = 0.267$. The second split wins here too, consistent with information gain.

Gini and entropy almost always agree on which split to choose. The practical advantage of Gini is computational: it avoids computing logarithms. In practice, CART with Gini is the default in most implementations, including scikit-learn.

For regression trees, the impurity measure is variance: the cost of a node $S$ with real-valued targets $y_1, \ldots, y_{|S|}$ is their variance $\sigma^2(S) = \frac{1}{|S|}\sum_i (y_i - \bar{y})^2$, and we choose the split minimizing the weighted variance of the children. A leaf in a regression tree predicts the mean $\bar{y}$ of the targets that fall into it.

Building a Tree: The CART Algorithm

CART builds a tree top-down using a greedy procedure. At each node, it considers every feature and, for each feature, every possible threshold, meaning every midpoint between consecutive distinct values of that feature in the current node’s training set. It evaluates the Gini cost (or variance reduction) for each (feature, threshold) pair and chooses the one with the minimum weighted child impurity. It then recurses independently on the left and right children.

Running time. Sorting $n$ examples on one feature takes $O(n \log n)$. Evaluating all thresholds for one feature after sorting takes $O(n)$. Doing this for all $d$ features at one node takes $O(dn \log n)$. At each level of the tree, all nodes together cover all $n$ examples, so the cost per level is also $O(dn \log n)$. For a tree of depth $\ell$, the total cost is $O(dn\ell \log n)$. For a balanced tree of depth $\log n$, this becomes $O(dn \log^2 n)$.

Stopping criteria. CART stops splitting a node when any of the following hold: the node is pure (all examples in $S$ belong to one class); the number of examples in the node is smaller than a minimum threshold min_samples_split; the maximum allowed depth has been reached; or the best available split produces no impurity reduction. Without any stopping criterion, CART will grow a tree until every leaf is pure, which on training data is always achievable (assuming no two examples have identical features but different labels).

Handling categorical features. For a feature with $k$ categorical values, CART in principle considers all $2^{k-1} - 1$ ways to partition the values into two groups. For binary classification, there is an efficient shortcut: sort the categorical values by their proportion of positive examples, then find the best threshold in this sorted order. For multi-class problems or regression, the full search may be necessary for small $k$ but is typically limited or approximated for large $k$.

Leaf Labeling and Soft Predictions

When the tree stops growing, each leaf contains a subset of training examples. In a classification tree, the leaf is labeled with the majority class among those examples. Any new example that falls into that leaf receives that majority class as its prediction.

This hard prediction can be augmented with a soft prediction: instead of emitting a single class label, the leaf emits a probability vector, where the probability for class $c$ is simply the fraction of training examples in the leaf that belong to class $c$:

$$p(c \mid \text{leaf}) = \frac{\text{count of class } c \text{ in leaf}}{\text{total examples in leaf}}$$

These probabilities can be calibrated (they tend to be overconfident for deep trees that have few examples per leaf) but they provide a meaningful uncertainty estimate. A leaf with 49 class A examples and 1 class B example predicts class A but with soft probability 0.98 rather than 1.0, reflecting the small residual uncertainty.

In a regression tree, the leaf predicts the mean of the target values among its training examples. The leaf can also emit a prediction interval using the standard deviation of those values.

Overfitting and Pruning

A fully grown tree, one grown until every leaf is pure, has zero training error by construction. On any finite training set, you can always achieve this, because eventually the recursive splitting isolates every training example into its own leaf. But a tree that has memorized every training example has high variance: small changes to the training set produce very different trees. This is the classic overfitting problem, and it is especially severe for trees because the model is highly expressive and the greedy top-down construction has no global optimization signal.

There are two families of solutions.

Pre-pruning (early stopping) addresses overfitting by refusing to make a split in the first place. You stop splitting a node if: the best available information gain is below a threshold $\epsilon$; the node contains fewer than min_samples examples; the depth has reached a maximum. The disadvantage of pre-pruning is that it is myopic. A split that looks useless in isolation might unlock very useful splits below it; by stopping early you may miss structure that only becomes visible two or three levels down.

Post-pruning takes the opposite approach: grow the full tree first, then work backwards from the leaves. The most principled form is cost-complexity pruning, also called weakest-link pruning, which is what CART uses. Define the regularized cost of a tree $T$:

$$C_\alpha(T) = \text{Training Error}(T) + \alpha \cdot |T|$$

where $|T|$ is the number of leaves in $T$ and $\alpha \geq 0$ is a complexity penalty. When $\alpha = 0$ this is just training error, minimized by the fully grown tree. As $\alpha$ increases, simpler trees become preferable. For each value of $\alpha$, there is a unique subtree of the full tree that minimizes $C_\alpha$, and this subtree can be found efficiently. You sweep over a sequence of $\alpha$ values, obtain the corresponding subtrees, and choose among them by cross-validation on a held-out set.

Reduced error pruning is a simpler post-pruning strategy. Set aside a validation set. For each internal node in the fully grown tree, consider replacing the entire subtree rooted at that node with a single leaf (labeled with the majority class of the training examples at that node). Compute validation error before and after this replacement. If validation error does not increase, make the replacement permanent. Repeat, bottom-up, until no more replacements are beneficial. This is greedy and uses only validation error, making it fast but less principled than cost-complexity pruning. Empirically, both methods produce trees of similar quality when the validation set is large enough.

Properties, Strengths, and Weaknesses

Strengths. Decision trees require no preprocessing: no feature scaling, no normalization, no encoding of ordinal relationships. They handle mixed feature types (continuous, ordinal, nominal) with equal ease. They are interpretable: you can print the tree and a domain expert can validate each splitting rule. They provide implicit feature importance: features that appear near the root are more important because their splits affect the largest sets of examples, and you can quantify this by summing the impurity reduction attributable to each feature across all nodes where it is used.

Weaknesses. Trees are highly unstable. A small change to the training data can produce a completely different tree structure, because the greedy splitting is sensitive to small perturbations at each node. This instability is a signature of high variance. Trees also struggle with diagonal boundaries: if the true decision boundary is $x_1 + x_2 = 0$, an axis-aligned tree needs many splits (a staircase approximation) to represent it, and the approximation is never exact. The bias-variance tradeoff is controlled primarily by tree depth: shallow trees have high bias (too coarse), deep trees have high variance (too specific to training data).

Feature importance. For a feature $j$, its importance is computed as the total impurity reduction it causes across all nodes in the tree where it is the splitting feature:

$$\text{Importance}(j) = \sum_{\text{nodes } t \text{ splitting on } j} \frac{|S_t|}{n} \bigl[\text{Impurity}(S_t) - \frac{|S_{t,L}|}{|S_t|}\text{Impurity}(S_{t,L}) - \frac{|S_{t,R}|}{|S_t|}\text{Impurity}(S_{t,R})\bigr]$$

This is normalized so that importances sum to 1 across all features. A feature that reduces impurity a lot on many examples scores highly; a feature that never appears in any split scores zero.

The solution to instability. Rather than using a single tree, build many trees on perturbed versions of the training data and average their predictions. This is ensemble learning. Random Forests reduce variance by combining hundreds of independently grown trees, each trained on a bootstrap sample and each using a random subset of features at each split. Gradient Boosting builds trees sequentially, with each tree correcting the errors of the ensemble so far. Both methods inherit the interpretability of individual trees (through aggregated feature importances) while dramatically improving generalization.

Decision Trees as Rule Systems

Every path from the root to a leaf in a decision tree corresponds to a conjunctive rule. Consider a tree with two internal nodes: the root splits on $x_1 < 5$, and the left child splits on $x_2 > 2$. The left-left leaf (reached when $x_1 < 5$ and $x_2 > 2$) corresponds to the rule:

$$\text{if } x_1 < 5 \text{ AND } x_2 > 2 \text{ then class A}$$

More generally, any leaf at depth $k$ is reached by a conjunction of $k$ conditions, one per edge on the path from the root. The complete set of leaf-rules is mutually exclusive (a new example falls into exactly one leaf) and exhaustive (every example falls into some leaf). This means the decision tree is logically equivalent to a set of rules that collectively cover all of input space without overlap.

This rule representation has a direct use in knowledge distillation. If you have a complex black-box model, you can approximate it by generating many training examples, labeling them with the black-box model’s predictions, fitting a shallow decision tree to those labeled examples, and extracting the rules. The result is a human-readable approximation to the black-box. The fidelity depends on how many examples you generate and how well the tree captures the model’s decision boundaries; for complex models, a single tree approximation will be crude, but it can reveal dominant patterns that a domain expert can validate or act on.

This equivalence between trees and rules also motivates rule-learning algorithms that search the rule space directly rather than building trees. But because the tree construction is computationally tractable (greedy, polynomial time) while exhaustive rule search is not, decision trees remain the standard approach to interpretable rule-based classification.

Read Next: