What is LASSO Regression?
Definition
$$ \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{bmatrix} $$ Given $n$ data points and $p < n$, a linear multiple regression model can be represented by a design matrix as shown above, and succinctly expressed as $Y = X \beta + \varepsilon$. Here, the following optimization problem is called BPDN, and solving BPDN is referred to as LASSO or LASSO Regression. $$ \argmin_{\beta} \left( {{ 1 } \over { 2 }} \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{1} \right) $$ In this context, $\lambda \ge 0$ is known as the Tuning Parameter.
- $\left\| \cdot \right\|_{2}$ represents the Euclidean norm, and $\left\| \cdot \right\|_{1}$ denotes the sum of absolute values.
Explanation
LASSO, as the name suggests, shrinks the absolute values of the regression coefficients in a regression problem to the minimum necessary for selection.
In machine learning, BPDN can also be seen as an $l_{1}$ regularization that adds $\lambda \left\| \beta \right\|_{1}$ to the general least squares problem.
Why Use It?
It’s beneficial to read the sparse regression document first:
- From a statistical perspective: It’s used to find models that are easy to interpret. Fundamentally, if the size of the regression coefficients is not significantly large compared to the scale of the data, they are considered statistically insignificant. In other words, if they are almost $0$, they might not be necessary to explain the data. This interpretation may not be exactly the same in LASSO regression, but the ultimate goal is to identify and address ‘small-sized regression coefficients’.
- From a machine learning perspective: It’s used to prevent overfitting. A model might be created to cover even the most special cases by adding very complex terms or acquiring additional data to describe the given data well. However, meticulously fitting the model might result in excellent performance on training data but poor performance in practical tests. Therefore, despite the penalty reducing the explanatory power of the data, it’s a way to avoid the risk of overfitting associated with finding minor regression coefficients for countless variables.
This is just a difference in perspective, but if read carefully, it’s essentially the same statement.
Tuning Parameter $\lambda$
- This explanation is about ridge regression, but the context of handling the tuning parameter applies to LASSO regression as well.
The tuning parameter $\lambda$ introduced in the definition increases the penalty $\left\| \lambda \right\|$ as it increases. If it’s too small, it’s no different from regular regression analysis, and if it’s too large, $\beta = \mathbf{0}$ becomes the best choice regardless of whether it explains the data or not. As an extreme and intuitive example, if the scale of the data values is around 0~10 but a large weight $\lambda = 10^{6}$ is given to the penalty, the focus on minimizing $\left\| \beta \right\|$ prevents the model from performing its primary function—creating a model that explains the data well. The point is that an appropriate $\lambda$ must be chosen, and there’s no inherent good or bad about $\lambda$ being large or small just by looking at the formula.
Therefore, without any specific intuition or criteria for the given data, one method is to vary $\lambda$ and select the one that minimizes the objective function. The above figure illustrates the error after cross-validation in an analysis by changing $\lambda$, plotted as a function of $\lambda$1. The graph shows the minimum value around $5 \times 10^{-1}$, marked with a vertical dotted line, and without a particular reason, it’s reasonable to use $\lambda$ at that value.
Differences from Ridge Regression
Historically, Ridge was introduced in 1970 as a method to increase the efficiency of parameter estimation by sacrificing a bit of unbiasedness to gain a bit of bias in the bias-variance trade-off relationship2, and LASSO was first introduced in 1986 in Geophysics and then reintroduced in 1996, when it was named LASSO.3
- Objective function of ridge regression: $$\argmin_{\beta} \left( \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{2}^{2} \right)$$
- Objective function of LASSO regression: $$\argmin_{\beta} \left( \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{1} \right)$$
As seen, the objective functions of ridge regression and LASSO regression are very similar, leading to frequent comparisons, but frankly, the only similarity is the formal appearance of the objective functions, and comparing their pros and cons might be an oversimplification. Commonly, the differences between ridge and LASSO are explained in terms of whether the penalty term is $l_{1}$ or $l_{2}$, whether the optimal solution is differentiable and can be easily represented in a closed form, or whether it can actually reduce certain coefficients to $0$. If more detailed, Python example codes are included, and empirically, it’s often mentioned what is generally superior and under what circumstances the other may be better.
… However, such explanations are abundantly available in various books, Wikipedia, blogs, etc. Posts that neatly summarize these well-known points can be easily found by searching ‘Ridge vs LASSO’ on Google. In this post, we aim to delve a bit deeper, just a little bit more, than them.
- The following content explains how LASSO regression differs from ridge regression from the perspective of LASSO regression. To see how ridge regression views LASSO regression, check the relevant post.
Generally, it’s often explained that LASSO regression is more expensive or computationally complex than ridge regression. While not entirely incorrect, if the goal is sparse regression, the additional cost might not be an issue.
In the above figure, $\hat{\beta}$ is the optimal solution for the original least squares problem, the left side represents the solution space of LASSO regression ($l_{1}$), and the right side represents ridge regression ($l_{2}$).4 This geometric explanation is so prevalent that searching ‘Ridge vs LASSO’ on Google yields countless images, making it an effective explanation.
- (Left) LASSO: The statement that the absolute values of regression coefficients $\beta_{1}$ and $\beta_{2}$ are equal to some value $r$ means that it’s a point on the square $\left| \beta_{1} \right| + \left| \beta_{2} \right| = r$ with side length $\sqrt{r}$. The optimal solution in BPDN must be a point where the contour of $\left\| Y - X \beta \right\|_{2}$ is lowest while also intersecting the square, and as shown in the figure, it can be precisely on an axis. Having a solution on an axis means that at least one dimension among others is $0$, indicating that LASSO can make certain regression coefficients exactly $0$.
- (Right) Ridge: The statement that the square of regression coefficients $\beta_{1}$ and $\beta_{2}$ equals some value $r$ means it’s a point on the circle $\beta_{1}^{2} + \beta_{2}^{2} = r$ with radius $r$. Similar to LASSO, at a certain level of $r$, the intersection of the contour created by $\hat{\beta}$ and the circle is almost certainly not on an axis.
Of course, ridge regression has its merits, but in the context of sparse regression, comparing it to a method that cannot make certain coefficients $0$ is odd. 5 Under certain types of data, depending on how $\lambda$ is given, in terms of overfitting, computational speed, and simplicity… all those assumptions might be irrelevant. Ridge regression cannot do what LASSO regression can, and there’s fundamentally no way to bridge this difference. The comparison between LASSO and ridge should not be based on minor differences or pros and cons.
- If only considering the outcome, $\beta_{k} = 0$ and $\beta_{k} \approx 0$ might not seem very different, but considering the process, the fact that the independent variable $X_{k}$ can be excluded from the start if $\beta_{k} = 0$ makes a significant difference. If, for instance, in ridge regression $\beta_{k} \approx 0$ but in LASSO regression it’s definitively $\beta_{k} = 0$ for about 1 million variables, even if the model performance is similar, ridge regression would require 1 million more multiplications per data point fitting.
Meanwhile, LASSO actually has a known closed form for the optimal solution under specific conditions, and soft thresholding is used in that optimal solution. In other words, the formula itself includes a part that makes $\beta_{k}$ $0$, and to grasp LASSO regression without relying on visuals or intuition, one must understand the following mathematical discussions.
Formulas
Optimal Solution 6
$$ L \left( \beta \right) = {{ 1 } \over { 2 }} \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{1} $$ When $\lambda$ is given as a constant, let’s express the objective function $L$ of LASSO regression as shown above. Generally, LASSO regression does not have a closed form for the optimal solution, but assuming all columns of $X$ are orthogonal to each other, the $k$-th component $\left( \hat{\beta} \right)_{k}$ of $\hat{\beta} = \argmin_{\beta} L \left( \beta \right)$ is as follows. $$ \begin{align*} \left( \hat{\beta} \right)_{k} =& {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \eta_{\lambda} \left( X^{T} Y \right)_{k} \\ = & {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \begin{cases} \left( X^{T} Y \right)_{k} + \lambda & , \text{if } \left( X^{T} Y \right)_{k} < - \lambda \\ 0 & , \text{if } \left( X^{T} Y \right)_{k} \in [-\lambda, \lambda] \\ \left( X^{T} Y \right)_{k} - \lambda & , \text{if } \left( X^{T} Y \right)_{k} > \lambda \end{cases} \end{align*} $$ Here, $A^{T}$ is the transpose matrix of $A$, $A^{-1}$ is the inverse matrix of $A$, and $\eta_{\lambda}$ is soft thresholding.
Derivation 7
Gradient of Vectors and Matrices: $$ \frac{ \partial }{ \partial \mathbf{w} }\left( \mathbf{w}^{T}\mathbf{R}\mathbf{w} \right)= \left( \mathbf{R} + \mathbf{R}^{T} \right) \mathbf{w} $$
Gradient of Residual Sum of Squares: $$ f \left( \mathbf{s} \right) := \left( \mathbf{y} - X \mathbf{s} \right)^{T} R \left( \mathbf{y} - X \mathbf{s} \right) $$ For a vector $\mathbf{y} \in \mathbb{R}^{n}$ and matrices $X \in \mathbb{R}^{n \times p}$, $R \in \mathbb{R}^{n \times n}$ that are not dependent on $\mathbf{s}$, the following holds true. $$ {{ \partial f \left( \mathbf{s} \right) } \over { \partial \mathbf{s} }} = - X^{T} \left( R + R^{T} \right) \left( \mathbf{y} - X \mathbf{s} \right) $$
Applying the above formulas when $R = I$ gives $$ \begin{align*} {{ \partial } \over { \partial \beta }} L \left( \beta \right) =& {{ \partial } \over { \partial \beta }} {{ 1 } \over { 2 }} \left\| Y - X \beta \right\|_{2}^{2} + {{ \partial } \over { \partial \beta }} \lambda \left\| \beta \right\| \\ =& {{ \partial } \over { \partial \beta }} {{ 1 } \over { 2 }} \left( Y - X \beta \right)^{T} \left( Y - X \beta \right) + \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \\ =& - {{ 1 } \over { 2 }} X^{T} \left( I + I^{T} \right) \left( Y - X \beta \right) + \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \\ =& - X^{T} \left( Y - X \beta \right) + \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \\ =& - X^{T} Y + X^{T} X \beta + \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \end{align*} $$ And, since $\beta = \hat{\beta}$ must satisfy ${{ \partial } \over { \partial \beta }} L = 0$, we obtain the following. $$ X^{T} X \hat{\beta} = X^{T} Y - \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| $$ Now, consider $\hat{\beta}_{k} := \left( \beta \right)_{k}$ for $k = 0, 1, \cdots, p$. If we denote the $i$-th row of matrix $A$ as $\left( A \right)_{i \cdot}$, we get the following equation. $$ \begin{align*} & X^{T} X \hat{\beta} = X^{T} Y - \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \\ \implies & \left( X^{T} X \right)_{i \cdot} \hat{\beta} = \left( X^{T} Y \right)_{i} - \lambda {{ \partial } \over { \partial \hat{\beta}_{i} }} \sum_{j=0}^{p} \left| \beta_{j} \right| \\ \implies & \sum_{j=0}^{p} \left( X^{T} X \right)_{i j} \hat{\beta}_{j} = \left( X^{T} Y \right)_{i} - \lambda {{ \partial } \over { \partial \hat{\beta}_{i} }} \left| \hat{\beta}_{i} \right| \\ \implies & \hat{\beta}_{i} = {{ 1 } \over { \left( X^{T} X \right)_{ii} }} \left[ \left( X^{T} Y \right)_{i} - \lambda {{ \partial } \over { \partial \hat{\beta}_{i} }} \left| \hat{\beta}_{i} \right| - \sum_{j \ne i} \left( X^{T} X \right)_{i j} \hat{\beta}_{j} \right] \end{align*} $$ Since $\hat{\beta}_{i}$ depends on other $\hat{\beta}_{j}$s, we can’t call this solution a closed form, and we can ignore $\hat{\beta}_{j}$ when $\left( X^{T} X \right)_{i j} = 0$. Adding the assumption that the columns of $X$ are orthogonal implies that $X^{T} X$ is a diagonal matrix.
$\hat{\beta}_{k}$ can be greater than, less than, or equal to $0$. If $\hat{\beta}_{k} > 0$, then $$ \begin{align*} & \hat{\beta}_{k} > 0 \\ \implies & \hat{\beta}_{k} = {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \left[ \left( X^{T} Y \right)_{k} - \lambda {{ \partial } \over { \partial \hat{\beta}_{k} }} \left| \hat{\beta}_{k} \right| \right] > 0 \\ \implies & \hat{\beta}_{k} = {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \left[ \left( X^{T} Y \right)_{k} - \lambda \right] > 0 \end{align*} $$ implies not only that $\hat{\beta}_{k} > 0$ but also that $\left( X^{T} Y \right)_{k} > \lambda$. Similarly, if $\hat{\beta}_{k} < 0$, then $$ \begin{align*} & \hat{\beta}_{k} < 0 \\ \implies & \hat{\beta}_{k} = {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \left[ \left( X^{T} Y \right)_{k} - \lambda (-1) \right] > 0 \\ \implies & \hat{\beta}_{k} = {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \left[ \left( X^{T} Y \right)_{k} + \lambda \right] < 0 \end{align*} $$ implies that $\left( X^{T} Y \right)_{k} < - \lambda$. Otherwise, $\hat{\beta}_{k} = 0$, and this can be summarized with soft thresholding as follows. $$ \begin{align*} \hat{\beta}_{k} =& {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \begin{cases} \left( X^{T} Y \right)_{k} + \lambda & , \text{if } \left( X^{T} Y \right)_{k} < - \lambda \\ 0 & , \text{if } \left( X^{T} Y \right)_{k} \in [-\lambda, \lambda] \\ \left( X^{T} Y \right)_{k} - \lambda & , \text{if } \left( X^{T} Y \right)_{k} > \lambda \end{cases} \\ = & {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \eta_{\lambda} \left( X^{T} Y \right)_{k} \end{align*} $$
■
Algorithm
$$ \hat{\beta}_{i} = {{ 1 } \over { \left( X^{T} X \right)_{ii} }} \left[ \left( X^{T} Y \right)_{i} - \lambda {{ \partial } \over { \partial \hat{\beta}_{i} }} \left| \hat{\beta}_{i} \right| - \sum_{j \ne i} \left( X^{T} X \right)_{i j} \hat{\beta}_{j} \right] $$ As already seen in the derivation process, just having implicit equations doesn’t mean we can’t find $\hat{\beta}_{j}$, and it’s stated that an approximate solution can be found using an iterative algorithm like gradient descent.8 With BPDN, unlike typical optimization problems, we at least have equations to verify our solution, making the situation somewhat favorable. On the other hand, even if gradient descent is used eventually, if the columns of $X$ are reasonably orthogonal, the optimal solution obtained from the above formula can be used as a starting point. It’s difficult to attribute issues with variable independence to a flaw in LASSO.
See Also
James. (2013). An Introduction to Statistical Learning with Applications in R: p228. ↩︎
James. (2013). An Introduction to Statistical Learning with Applications in R: p222. ↩︎
Muneyuki Kaneshiro, Yusuke Nomura. (2018). Blue Rock: Volume 1 ↩︎
https://machinelearningcompass.com/machine_learning_models/lasso_regression/ ↩︎