Decision Trees in Machine Learning
Definition
Decision tree refers to a machine learning technique that has a tree structure and, by branching according to input data features, ultimately determines an output value.
Explanation
Decision trees are among the classical machine learning methods that are intuitive and easy to understand, and naturally have a wide range of applications. If k-means clustering exists for unsupervised learning, it is not an exaggeration to say that decision trees occupy a comparable place in supervised learning.
Prime-identification problem
To understand how decision-making occurs in a tree structure, imagine the problem of “identifying primes among natural numbers from 1 to 10.” A decision tree splits the data according to the question at each node and ultimately makes a decision about “whether a natural number is prime” based on the tree.

- ($A$): This node is the root of the tree. It asks whether the number appears in the form $3n-1$. $2, 5, 8$ falls into this category.
- ($B_{1}$): Asks whether the number is less than $7$. $2, 5$ are all primes, and $8$ are composite numbers, so this node becomes a leaf node.
- ($B_{2}$): Asks whether the number appears in the form $2^{n} -1$. $1, 3, 7$ belong here. All other cases are composite numbers.
- ($C$): Asks whether the number is greater than $2$. $3, 7$ are all primes, and $1$ are composite numbers, so this node becomes a leaf node.
According to this structure, if we simply insert $x$ into this tree, we can decide whether $x$ is prime. However, as this example shows, decision trees are vulnerable to overfitting: although they achieved 100% accuracy on the given data $1, \cdots , 10$, $x = 11$ would immediately be judged composite at node ($B_{1}$). This is because the decision tree failed to discover the true rule for primality — it failed to capture the substantive features of the data and, by focusing on superficial performance, lacked generalized inferential capability.
CART algorithm 1
Of course, one does not manually construct trees as in the example above; in practice the tree is built by the CART algorithm. Fundamentally, CART adopts a strategy akin to binary search, repeatedly splitting a selected feature into two parts to improve performance and determine where splits should occur.
Classification problems
For classification problems, the tree is found by minimizing a loss function such as Gini impurity or entropy, for example: $$ G \left( D \right) = 1 - \sum_{k=1}^{C} p_{k}^{2} = \sum_{k=1}^{C} p_{k} \left( 1 - p_{k} \right) $$ Here, $p_{k}$ is the proportion of class $k$ in data $D$. The purer the data classified at a node, the smaller the Gini impurity becomes, which is interpreted as the decision tree classifying the data well.
We must again address overfitting: as seen in the prime-identification example, a decision tree can create cases that apply only to instances that appear exactly once in the data; consequently, for many datasets it is possible to increase splits arbitrarily to force-fit the data. A low Gini impurity means that the classification result has been partitioned into many pure subsets, but one should be careful not to mistake that for genuinely understanding the data.
Regression problems

Although decision trees are inherently predisposed toward classification, the CART algorithm — true to its name — can be applied to both classification and regression problems. For regression problems, features are partitioned and the mean of the dependent variable within each region is used as the prediction; the loss function is the mean squared error. In the figure, the feature $x$ is split into three regions depending on whether it is less than $x_{1}$, greater than $x_{2}$, or in between, and the mean indicated in orange is used in each region.
Yet again, the risk of overfitting lurks. In the figure the decision tree appears to capture a nonlinear relationship by splitting the data into three regions, but one can further subdivide the intervals to reduce the training loss:

Such forced fit can be carried to extremes, so the model’s performance must be continuously scrutinized. To become proficient with decision trees, the single most important, second-most important, and third-most important consideration is vigilance against overfitting.
Pruning
Pruning, literally “cutting branches,” is the practice of trimming branches of a decision tree to prevent the overfitting that arises from excessively subdividing cases as described above. Pruning methods are broadly classified into pre-pruning and post-pruning.
- Pre-pruning: When building the tree, impose constraints such as preventing the tree from growing beyond a certain depth or stopping splits when the number of samples in a node falls below a threshold.
- Post-pruning: First grow the tree fully, then identify nodes whose removal improves validation performance and prune those nodes.
See also
Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. Chapman and Hall/CRC. https://doi.org/10.1201/9781315139470 ↩︎
