Bagging and Boosting in Data Science 📂Data Science

Bagging and Boosting in Data Science

Terms

(mainly in data science) Bootstrap refers to a technique related to increasing the number of datasets by sampling parts of an existing dataset with replacement (resampling with replacement).
Bagging is short for bootstrap aggregating, a method that builds multiple models and forms a final model by aggregating them.
Boosting denotes a method that iteratively builds models while assigning larger weights to data points that were poorly predicted, thereby producing a final model that compensates previous weaknesses.

Explanation

Intuitively, bagging is parallel while boosting is sequential.

There is no shortage of online materials explaining bagging and boosting with diagrams or examples, but for those who prefer formulae, the paper referenced in this post may be more appealing¹. Let the total number of iterations in which “learning” occurs be $T$; at each such iteration, draw $n$ sub-datasets by sampling with replacement from $N$ data points, and denote the model that solves a regression or classification problem by $f^{(t)}$. Because $\left\{ f^{(t)} \right\}_{t=1}^{T}$ is trained on data smaller than the full dataset, it may be treated as a kind of weak model.

Bagging

At every $t = 1 , \cdots , T$-th iteration, sample $n$ data points with replacement and train a model on each sample to obtain $f^{(1)} , \cdots , f^{(T)}$. These models are independent, so they can be trained concurrently on multiple devices, and they facilitate cross-validation by using each other’s unused portions (out-of-bag samples) as test data.

Boosting

In the $t$-th iteration, the $N$ data points are sampled proportional to the weight vector $w^{(t)} = (w^{(t)}_{1}, \cdots , w^{(t)}_{N})$ (see weights and vector). Initially, set all data-point weights equal to $w_{k}^{1} = 1 / N$, and during training increase the weights of data points with large errors or misclassifications so they will be sampled more frequently in the next iteration. The $f^{(t)}$ models trained over successive generations thereby specialize in directions that progressively overcome prior weaknesses.

Final model

The final model $F$ is constructed differently for regression and classification problems. For regression, it is expressed as returning the mean of the outputs: $$ F(x) = {\frac{ 1 }{ T }} \left( f^{(1)}(x) + \cdots + f^{(T)}(x) \right) $$

For classification, it returns the output by majority vote as follows; $\operatorname{mode}$ denotes the mode. $$ F(x) = \operatorname{mode} \left\{ f^{(1)}(x), \cdots , f^{(T)}(x) \right\} $$

Note that these implementations are canonical in most situations but are not the only ways to interpret bagging and boosting. Conceptually, as stated above, the key distinction is whether the approach is parallel or sequential.

Quinlan, J. R. (1996, August). Bagging, boosting, and C4. 5. In Aaai/Iaai, vol. 1 (pp. 725-730). ↩︎