What is Gradient Boosting? 📂Machine Learning

What is Gradient Boosting?

Definition

Gradient boosting refers to a machine learning technique that improves performance by applying boosting to decision trees. The residuals produced by a decision tree’s predictions are treated as a new dependent variable to train another decision tree, and this process is repeated to determine the final model that corresponds to the residuals at each stage.

Description

Gradient boosting is a classical machine learning method mainly used for regression problems, and it shines in situations where predictive performance is valued more than interpretability. Personally, among classical methods I consider it the technique closest in spirit to deep learning: just as deep learning stacks nonlinear activation functions to approximate a target function without caring about the precise form, gradient boosting tackles the problem by combining decision trees whose individual forms need not be scrutinized. In practice it has achieved strong results in competitions and contests, repeatedly demonstrating outstanding generality.

Its popularity is another advantage: accessible implementations and enhancements of primitive gradient boosting—such as XGBoost, LightGBM, and CatBoost—have been released, making it easy to use. Because it is accessible, it gets used more; because it gets used more, it often achieves good results—a virtuous cycle that has made it widely loved.

Why “gradient”

Treating residuals as a dependent variable may at first glance seem unrelated to gradient descent, which computes gradients via partial derivatives and moves parameters in that direction. Formally, however, there is a similarity that can be shown. For a given dataset $\left\{ \left( x _{k} , y_{k} \right) \right\}_{k=1}^{n}$, the $k$-th residual $r_{t} \left( x_{k} \right)$ of the $t$-stage model $f_{t}$ is as follows. $$ r_{t} \left( x_{k} \right) = y_{k} - f_{t} \left( x_{k} \right) $$ Here, the decision tree $h_{t}$ newly trained to target $r_{t}$ at stage $t$ can be viewed as a linear combination of indicator functions that, according to node splits, predicts by the mean $\bar{r}_{i}$ of the residuals of the datapoints belonging to the $R_{i}$-th region $i$. $$ h_{t} (x) = \sum_{i} \bar{r}_{i} \mathbf{1}_{R_{i}} (x) $$ With this notation, $f_{t}$ is the sum of decision trees as follows. $$ f_{t} (x) = h_{1} (x) + h_{2} (x) + \cdots + h_{t} (x) $$ If $h_{t}$ can perfectly predict $r_{t}$—in other words, if it can satisfy $f_{t+1} \left( x_{k} \right) = y_{k}$ as desired—then $f_{t+1}$ will satisfy the following. $$ f_{t+1} \left( x_{k} \right) = f_{t} \left( x_{k} \right) + h_{t} \left( x_{k} \right) = y_{k} $$

For regression problems, the loss function $L$ is typically set to MSE, the mean of squared residuals. $$ L = {\frac{ 1 }{ n }} \sum_{k=1}^{n} \left( y_{k} - f_{t} \left( x_{k} \right) \right)^2 $$ The gradient of $L$ at $f_{t} (x)$ is $$ \begin{align*} & n \nabla L \\ =& {\frac{ \partial L }{ \partial f_{t} \left( x_{1} \right) }} + \cdots + {\frac{ \partial L }{ \partial f_{t} \left( x_{n} \right) }} \\ =& \sum_{k=1}^{n} -2 \left( y_{k} - f_{t} \left( x_{k} \right) \right) \\ =& -2 \sum_{k=1}^{n} h_{t} \left( x_{k} \right) \end{align*} $$ A point to note is that, unlike ordinary gradient descent where one differentiates with respect to $x$, here we differentiate at $f_{t} (x)$. This is natural because the current setting does not seek a single point in parameter space, but rather optimization with respect to all data points; many texts omit detailed explanation despite showing the formula, which can cause confusion. Looking at the entire dataset, it takes the following form. $$ \begin{align*} & \sum_{k} f_{t+1} \left( x_{k} \right) = \sum_{k} f_{t} \left( x_{k} \right) + \sum_{k} h_{t} \left( x_{k} \right) \\ \implies & \sum_{k} f_{t+1} \left( x_{k} \right) = \sum_{k} f_{t} \left( x_{k} \right) - {\frac{ n }{ 2 }} \nabla L \\ \implies & f_{t+1} (x) \approx f_{t} (x) - {\frac{ 1 }{ 2 }} \nabla L \end{align*} $$ Here, the closer $\nabla L$ is to $0$, the smaller $f_{t+1} \approx f_{t}$ will be, and the closer we are to the desired approximation.

Gradient descent: To find a local minimum of the loss function $L$, the following update rule is called gradient descent. $$\mathbf{w}_{t+1} \gets \mathbf{w}_{t} - \alpha \nabla L$$

Finally, if we denote $\mathbf{w}_{t} = f_{t} (x)$, we can see that the essence of gradient boosting is the same as gradient descent. The learning rate $\alpha$ can be mapped to the weights of the decision trees as $f_{t+1} (x) = f_{t} (x) + \alpha h_{t} (x)$.

Why “boosting”

Gradient boosting, by sequentially using residuals as the dependent variable, can be seen as compensating for weaknesses that previous models failed to capture. This qualifies as boosting even without an explicit step of computing weights or other separate procedures.