logo

What is LASSO Regression? 📂Statistical Analysis

What is LASSO Regression?

Definition

[y1y2yn]=[1x11xp11x12xp21x1nxpn][β0β1βp]+[ε1ε2εn] \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{bmatrix} Given nn data points and p<np < n, a linear multiple regression model can be represented by a design matrix as shown above, and succinctly expressed as Y=Xβ+εY = X \beta + \varepsilon. Here, the following optimization problem is called BPDN, and solving BPDN is referred to as LASSO or LASSO Regression. arg minβ(12YXβ22+λβ1) \argmin_{\beta} \left( {{ 1 } \over { 2 }} \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{1} \right) In this context, λ0\lambda \ge 0 is known as the Tuning Parameter.


  • 2\left\| \cdot \right\|_{2} represents the Euclidean norm, and 1\left\| \cdot \right\|_{1} denotes the sum of absolute values.

Explanation

LASSO, as the name suggests, shrinks the absolute values of the regression coefficients in a regression problem to the minimum necessary for selection.

In machine learning, BPDN can also be seen as an l1l_{1} regularization that adds λβ1\lambda \left\| \beta \right\|_{1} to the general least squares problem.

Why Use It?

It’s beneficial to read the sparse regression document first:

  • From a statistical perspective: It’s used to find models that are easy to interpret. Fundamentally, if the size of the regression coefficients is not significantly large compared to the scale of the data, they are considered statistically insignificant. In other words, if they are almost 00, they might not be necessary to explain the data. This interpretation may not be exactly the same in LASSO regression, but the ultimate goal is to identify and address ‘small-sized regression coefficients’.
  • From a machine learning perspective: It’s used to prevent overfitting. A model might be created to cover even the most special cases by adding very complex terms or acquiring additional data to describe the given data well. However, meticulously fitting the model might result in excellent performance on training data but poor performance in practical tests. Therefore, despite the penalty reducing the explanatory power of the data, it’s a way to avoid the risk of overfitting associated with finding minor regression coefficients for countless variables.

This is just a difference in perspective, but if read carefully, it’s essentially the same statement.

Tuning Parameter λ\lambda

  • This explanation is about ridge regression, but the context of handling the tuning parameter applies to LASSO regression as well.

The tuning parameter λ\lambda introduced in the definition increases the penalty λ\left\| \lambda \right\| as it increases. If it’s too small, it’s no different from regular regression analysis, and if it’s too large, β=0\beta = \mathbf{0} becomes the best choice regardless of whether it explains the data or not. As an extreme and intuitive example, if the scale of the data values is around 0~10 but a large weight λ=106\lambda = 10^{6} is given to the penalty, the focus on minimizing β\left\| \beta \right\| prevents the model from performing its primary function—creating a model that explains the data well. The point is that an appropriate λ\lambda must be chosen, and there’s no inherent good or bad about λ\lambda being large or small just by looking at the formula.

Therefore, without any specific intuition or criteria for the given data, one method is to vary λ\lambda and select the one that minimizes the objective function. The above figure illustrates the error after cross-validation in an analysis by changing λ\lambda, plotted as a function of λ\lambda1. The graph shows the minimum value around 5×1015 \times 10^{-1}, marked with a vertical dotted line, and without a particular reason, it’s reasonable to use λ\lambda at that value.

Differences from Ridge Regression

Historically, Ridge was introduced in 1970 as a method to increase the efficiency of parameter estimation by sacrificing a bit of unbiasedness to gain a bit of bias in the bias-variance trade-off relationship2, and LASSO was first introduced in 1986 in Geophysics and then reintroduced in 1996, when it was named LASSO.3

  • Objective function of ridge regression: arg minβ(YXβ22+λβ22)\argmin_{\beta} \left( \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{2}^{2} \right)
  • Objective function of LASSO regression: arg minβ(YXβ22+λβ1)\argmin_{\beta} \left( \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{1} \right)

As seen, the objective functions of ridge regression and LASSO regression are very similar, leading to frequent comparisons, but frankly, the only similarity is the formal appearance of the objective functions, and comparing their pros and cons might be an oversimplification. Commonly, the differences between ridge and LASSO are explained in terms of whether the penalty term is l1l_{1} or l2l_{2}, whether the optimal solution is differentiable and can be easily represented in a closed form, or whether it can actually reduce certain coefficients to 00. If more detailed, Python example codes are included, and empirically, it’s often mentioned what is generally superior and under what circumstances the other may be better.

… However, such explanations are abundantly available in various books, Wikipedia, blogs, etc. Posts that neatly summarize these well-known points can be easily found by searching ‘Ridge vs LASSO’ on Google. In this post, we aim to delve a bit deeper, just a little bit more, than them.


  • The following content explains how LASSO regression differs from ridge regression from the perspective of LASSO regression. To see how ridge regression views LASSO regression, check the relevant post.

Generally, it’s often explained that LASSO regression is more expensive or computationally complex than ridge regression. While not entirely incorrect, if the goal is sparse regression, the additional cost might not be an issue.

In the above figure, β^\hat{\beta} is the optimal solution for the original least squares problem, the left side represents the solution space of LASSO regression (l1l_{1}), and the right side represents ridge regression (l2l_{2}).4 This geometric explanation is so prevalent that searching ‘Ridge vs LASSO’ on Google yields countless images, making it an effective explanation.

  • (Left) LASSO: The statement that the absolute values of regression coefficients β1\beta_{1} and β2\beta_{2} are equal to some value rr means that it’s a point on the square β1+β2=r\left| \beta_{1} \right| + \left| \beta_{2} \right| = r with side length r\sqrt{r}. The optimal solution in BPDN must be a point where the contour of YXβ2\left\| Y - X \beta \right\|_{2} is lowest while also intersecting the square, and as shown in the figure, it can be precisely on an axis. Having a solution on an axis means that at least one dimension among others is 00, indicating that LASSO can make certain regression coefficients exactly 00.
  • (Right) Ridge: The statement that the square of regression coefficients β1\beta_{1} and β2\beta_{2} equals some value rr means it’s a point on the circle β12+β22=r\beta_{1}^{2} + \beta_{2}^{2} = r with radius rr. Similar to LASSO, at a certain level of rr, the intersection of the contour created by β^\hat{\beta} and the circle is almost certainly not on an axis.

Of course, ridge regression has its merits, but in the context of sparse regression, comparing it to a method that cannot make certain coefficients 00 is odd. 5 Under certain types of data, depending on how λ\lambda is given, in terms of overfitting, computational speed, and simplicity… all those assumptions might be irrelevant. Ridge regression cannot do what LASSO regression can, and there’s fundamentally no way to bridge this difference. The comparison between LASSO and ridge should not be based on minor differences or pros and cons.

  • If only considering the outcome, βk=0\beta_{k} = 0 and βk0\beta_{k} \approx 0 might not seem very different, but considering the process, the fact that the independent variable XkX_{k} can be excluded from the start if βk=0\beta_{k} = 0 makes a significant difference. If, for instance, in ridge regression βk0\beta_{k} \approx 0 but in LASSO regression it’s definitively βk=0\beta_{k} = 0 for about 1 million variables, even if the model performance is similar, ridge regression would require 1 million more multiplications per data point fitting.

Meanwhile, LASSO actually has a known closed form for the optimal solution under specific conditions, and soft thresholding is used in that optimal solution. In other words, the formula itself includes a part that makes βk\beta_{k} 00, and to grasp LASSO regression without relying on visuals or intuition, one must understand the following mathematical discussions.

Formulas

Optimal Solution 6

L(β)=12YXβ22+λβ1 L \left( \beta \right) = {{ 1 } \over { 2 }} \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{1} When λ\lambda is given as a constant, let’s express the objective function LL of LASSO regression as shown above. Generally, LASSO regression does not have a closed form for the optimal solution, but assuming all columns of XX are orthogonal to each other, the kk-th component (β^)k\left( \hat{\beta} \right)_{k} of β^=arg minβL(β)\hat{\beta} = \argmin_{\beta} L \left( \beta \right) is as follows. (β^)k=1(XTX)kkηλ(XTY)k=1(XTX)kk{(XTY)k+λ,if (XTY)k<λ0,if (XTY)k[λ,λ](XTY)kλ,if (XTY)k>λ \begin{align*} \left( \hat{\beta} \right)_{k} =& {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \eta_{\lambda} \left( X^{T} Y \right)_{k} \\ = & {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \begin{cases} \left( X^{T} Y \right)_{k} + \lambda & , \text{if } \left( X^{T} Y \right)_{k} < - \lambda \\ 0 & , \text{if } \left( X^{T} Y \right)_{k} \in [-\lambda, \lambda] \\ \left( X^{T} Y \right)_{k} - \lambda & , \text{if } \left( X^{T} Y \right)_{k} > \lambda \end{cases} \end{align*} Here, ATA^{T} is the transpose matrix of AA, A1A^{-1} is the inverse matrix of AA, and ηλ\eta_{\lambda} is soft thresholding.

Derivation 7

Gradient of Vectors and Matrices: w(wTRw)=(R+RT)w \frac{ \partial }{ \partial \mathbf{w} }\left( \mathbf{w}^{T}\mathbf{R}\mathbf{w} \right)= \left( \mathbf{R} + \mathbf{R}^{T} \right) \mathbf{w}

Gradient of Residual Sum of Squares: f(s):=(yXs)TR(yXs) f \left( \mathbf{s} \right) := \left( \mathbf{y} - X \mathbf{s} \right)^{T} R \left( \mathbf{y} - X \mathbf{s} \right) For a vector yRn\mathbf{y} \in \mathbb{R}^{n} and matrices XRn×pX \in \mathbb{R}^{n \times p}, RRn×nR \in \mathbb{R}^{n \times n} that are not dependent on s\mathbf{s}, the following holds true. f(s)s=XT(R+RT)(yXs) {{ \partial f \left( \mathbf{s} \right) } \over { \partial \mathbf{s} }} = - X^{T} \left( R + R^{T} \right) \left( \mathbf{y} - X \mathbf{s} \right)

Applying the above formulas when R=IR = I gives βL(β)=β12YXβ22+βλβ=β12(YXβ)T(YXβ)+λββ=12XT(I+IT)(YXβ)+λββ=XT(YXβ)+λββ=XTY+XTXβ+λββ \begin{align*} {{ \partial } \over { \partial \beta }} L \left( \beta \right) =& {{ \partial } \over { \partial \beta }} {{ 1 } \over { 2 }} \left\| Y - X \beta \right\|_{2}^{2} + {{ \partial } \over { \partial \beta }} \lambda \left\| \beta \right\| \\ =& {{ \partial } \over { \partial \beta }} {{ 1 } \over { 2 }} \left( Y - X \beta \right)^{T} \left( Y - X \beta \right) + \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \\ =& - {{ 1 } \over { 2 }} X^{T} \left( I + I^{T} \right) \left( Y - X \beta \right) + \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \\ =& - X^{T} \left( Y - X \beta \right) + \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \\ =& - X^{T} Y + X^{T} X \beta + \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \end{align*} And, since β=β^\beta = \hat{\beta} must satisfy βL=0{{ \partial } \over { \partial \beta }} L = 0, we obtain the following. XTXβ^=XTYλββ X^{T} X \hat{\beta} = X^{T} Y - \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| Now, consider β^k:=(β)k\hat{\beta}_{k} := \left( \beta \right)_{k} for k=0,1,,pk = 0, 1, \cdots, p. If we denote the ii-th row of matrix AA as (A)i\left( A \right)_{i \cdot}, we get the following equation. XTXβ^=XTYλββ    (XTX)iβ^=(XTY)iλβ^ij=0pβj    j=0p(XTX)ijβ^j=(XTY)iλβ^iβ^i    β^i=1(XTX)ii[(XTY)iλβ^iβ^iji(XTX)ijβ^j] \begin{align*} & X^{T} X \hat{\beta} = X^{T} Y - \lambda {{ \partial } \over { \partial \beta }} \left\| \beta \right\| \\ \implies & \left( X^{T} X \right)_{i \cdot} \hat{\beta} = \left( X^{T} Y \right)_{i} - \lambda {{ \partial } \over { \partial \hat{\beta}_{i} }} \sum_{j=0}^{p} \left| \beta_{j} \right| \\ \implies & \sum_{j=0}^{p} \left( X^{T} X \right)_{i j} \hat{\beta}_{j} = \left( X^{T} Y \right)_{i} - \lambda {{ \partial } \over { \partial \hat{\beta}_{i} }} \left| \hat{\beta}_{i} \right| \\ \implies & \hat{\beta}_{i} = {{ 1 } \over { \left( X^{T} X \right)_{ii} }} \left[ \left( X^{T} Y \right)_{i} - \lambda {{ \partial } \over { \partial \hat{\beta}_{i} }} \left| \hat{\beta}_{i} \right| - \sum_{j \ne i} \left( X^{T} X \right)_{i j} \hat{\beta}_{j} \right] \end{align*} Since β^i\hat{\beta}_{i} depends on other β^j\hat{\beta}_{j}s, we can’t call this solution a closed form, and we can ignore β^j\hat{\beta}_{j} when (XTX)ij=0\left( X^{T} X \right)_{i j} = 0. Adding the assumption that the columns of XX are orthogonal implies that XTXX^{T} X is a diagonal matrix.

β^k\hat{\beta}_{k} can be greater than, less than, or equal to 00. If β^k>0\hat{\beta}_{k} > 0, then β^k>0    β^k=1(XTX)kk[(XTY)kλβ^kβ^k]>0    β^k=1(XTX)kk[(XTY)kλ]>0 \begin{align*} & \hat{\beta}_{k} > 0 \\ \implies & \hat{\beta}_{k} = {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \left[ \left( X^{T} Y \right)_{k} - \lambda {{ \partial } \over { \partial \hat{\beta}_{k} }} \left| \hat{\beta}_{k} \right| \right] > 0 \\ \implies & \hat{\beta}_{k} = {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \left[ \left( X^{T} Y \right)_{k} - \lambda \right] > 0 \end{align*} implies not only that β^k>0\hat{\beta}_{k} > 0 but also that (XTY)k>λ\left( X^{T} Y \right)_{k} > \lambda. Similarly, if β^k<0\hat{\beta}_{k} < 0, then β^k<0    β^k=1(XTX)kk[(XTY)kλ(1)]>0    β^k=1(XTX)kk[(XTY)k+λ]<0 \begin{align*} & \hat{\beta}_{k} < 0 \\ \implies & \hat{\beta}_{k} = {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \left[ \left( X^{T} Y \right)_{k} - \lambda (-1) \right] > 0 \\ \implies & \hat{\beta}_{k} = {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \left[ \left( X^{T} Y \right)_{k} + \lambda \right] < 0 \end{align*} implies that (XTY)k<λ\left( X^{T} Y \right)_{k} < - \lambda. Otherwise, β^k=0\hat{\beta}_{k} = 0, and this can be summarized with soft thresholding as follows. β^k=1(XTX)kk{(XTY)k+λ,if (XTY)k<λ0,if (XTY)k[λ,λ](XTY)kλ,if (XTY)k>λ=1(XTX)kkηλ(XTY)k \begin{align*} \hat{\beta}_{k} =& {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \begin{cases} \left( X^{T} Y \right)_{k} + \lambda & , \text{if } \left( X^{T} Y \right)_{k} < - \lambda \\ 0 & , \text{if } \left( X^{T} Y \right)_{k} \in [-\lambda, \lambda] \\ \left( X^{T} Y \right)_{k} - \lambda & , \text{if } \left( X^{T} Y \right)_{k} > \lambda \end{cases} \\ = & {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \eta_{\lambda} \left( X^{T} Y \right)_{k} \end{align*}

Algorithm

β^i=1(XTX)ii[(XTY)iλβ^iβ^iji(XTX)ijβ^j] \hat{\beta}_{i} = {{ 1 } \over { \left( X^{T} X \right)_{ii} }} \left[ \left( X^{T} Y \right)_{i} - \lambda {{ \partial } \over { \partial \hat{\beta}_{i} }} \left| \hat{\beta}_{i} \right| - \sum_{j \ne i} \left( X^{T} X \right)_{i j} \hat{\beta}_{j} \right] As already seen in the derivation process, just having implicit equations doesn’t mean we can’t find β^j\hat{\beta}_{j}, and it’s stated that an approximate solution can be found using an iterative algorithm like gradient descent.8 With BPDN, unlike typical optimization problems, we at least have equations to verify our solution, making the situation somewhat favorable. On the other hand, even if gradient descent is used eventually, if the columns of XX are reasonably orthogonal, the optimal solution obtained from the above formula can be used as a starting point. It’s difficult to attribute issues with variable independence to a flaw in LASSO.

See Also


  1. James. (2013). An Introduction to Statistical Learning with Applications in R: p228. ↩︎

  2. https://en.wikipedia.org/wiki/Ridge_regression ↩︎

  3. https://en.wikipedia.org/wiki/Lasso_(statistics) ↩︎

  4. James. (2013). An Introduction to Statistical Learning with Applications in R: p222. ↩︎

  5. Muneyuki Kaneshiro, Yusuke Nomura. (2018). Blue Rock: Volume 1 ↩︎

  6. https://stats.stackexchange.com/q/174003/172321 ↩︎

  7. https://stats.stackexchange.com/a/246169/172321 ↩︎

  8. https://machinelearningcompass.com/machine_learning_models/lasso_regression/ ↩︎