What is Ridge Regression? 📂Statistical Analysis

What is Ridge Regression?

Definition

$\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{bmatrix}$ Given a set of data with $n$ and calling it $p < n$ , representing the linear multiple regression model with a design matrix can be simplified as $Y = X \beta + \varepsilon$ . At this point, solving the following optimization problem is referred to as Ridge Regression. $\argmin_{\beta} \left( \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{2}^{2} \right)$ Here, $\lambda \ge 0$ is called the Tuning Parameter.

$\left\| \cdot \right\|_{2}$ is the Euclidean norm.

Description

Ridge regression is sometimes translated as ridge regression in Korean, but it’s often pronounced as it is in English, where the pronunciation is not [ridge] but ridge [rɪdʒ], similar to how Edge is not pronounced as [edge] but as edge [edʒ]. Ridge means ridge, and ridge also comes from the word ridge line, which means ‘a line that connects ridges on a ridge’. This is because geometrically, the hyper surface formed by the objective function of $\beta$ , the RSS in ordinary multiple regression analysis, creates ridges due to $\lambda \left\| \beta \right\|_{2}$ .

Meanwhile, in machine learning, adding $\lambda \left\| \beta \right\|_{2}$ like this regularization is also called Tikhonov regularization¹. Minimizing this term along with RSS means not just doing regression analysis, but also making the size of $\beta$ , that is, the size of the regression coefficients, as small as possible, meaning that many regression coefficients close to $0$ increase.

Regardless of the actual history or interpretation of the results, it’s hard to deny that ridge regression can be considered a type of sparse regression.

Why Use It?

It’s better to read the sparse regression document first:

From a statistical perspective: It’s for finding models that are simple to interpret. Fundamentally, the size of the regression coefficients was considered statistically insignificant unless it was sufficiently large compared to the scale of the data. In other words, even if it’s not exactly $0$ , if it’s almost $0$ , it might not be necessary to explain the data. This interpretation may not be exactly the same in ridge regression, but ultimately the goal is to find and deal with ‘small regression coefficients’.
From a machine learning perspective: It’s to prevent overfitting. One might be able to create a model that covers even very special cases by adding very complex terms or acquiring additional data to explain the given data well, but if it’s too meticulously tailored, it might perform excellently on training data but poorly in practical tests. Finding minutely detailed regression coefficients for countless variables increases the risk of overfitting, so the goal is to prevent it, even if it means taking on a penalty that reduces the data’s explanatory power.

This is just a difference in perspective, but it’s essentially the same thing when examined closely.

Tuning Parameter $\lambda$

The tuning parameter $\lambda$ introduced in the definition becomes stronger as the penalty $\left\| \lambda \right\|$ increases, but if it’s too small, it’s no different from regular regression analysis, and if it’s too large, $\beta = \mathbf{0}$ becomes the best choice regardless of whether it explains the data or not. As an extreme and intuitive example, if the scale of values in the data is around 0~10 and too much weight $\lambda = 10^{6}$ is given to the penalty, the focus is lost on minimizing $\left\| \beta \right\|$ , completely neglecting the original task—establishing a model that explains the data well.

The point is to choose an appropriate $\lambda$ , and without any special intuition or criteria for the given data, simply changing $\lambda$ and selecting the one that minimizes the objective function is one method. The figure above shows a graph of the error after changing $\lambda$ and undergoing cross-validation in some analysis², with the vertical dashed line representing the minimum value around $5 \times 10^{-1}$ , and it’s reasonable to use that value for $\lambda$ if there’s no particular reason otherwise.

Difference from Lasso Regression

Historically, Ridge was introduced in 1970 as a method to increase the efficiency of parameter estimation by sacrificing some unbiasedness to a certain extent in the bias-variance trade-off relationship, even if it introduced a bit of bias.³ LASSO was first introduced in 1986 in Geophysics and then reintroduced in 1996, when it was named LASSO.⁴

Objective function of ridge regression: $\argmin_{\beta} \left( \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{2}^{2} \right)$
Objective function of lasso regression: $\argmin_{\beta} \left( \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{1} \right)$

While the objective functions of ridge regression and lasso regression are very similar, leading to frequent comparisons, it’s important to note that they are quite different beyond the formal appearance of their objective functions. The common differentiation points between ridge and lasso, such as whether the penalty term is $l_{1}$ or $l_{2}$ , whether it’s differentiable and thus the optimal solution can be expressed simply in closed form, or whether it can actually reduce coefficients to $0$ , are often mentioned. More detailed comparisons might include Python example codes and empirical observations on which is generally superior or when one might be better than the other.

… However, such explanations are abundantly available in books, Wikipedia, blogs, etc. Well-organized articles that neatly summarize these well-known points can be easily found by searching ‘Ridge vs LASSO’ on Google. In this post, we aim to delve just a little deeper than those comparisons.

The content below explains how ridge regression differs from lasso from the perspective of ridge regression. For the perspective of lasso on ridge regression, see this post.

Even if one compares ‘which performs better on given data’, it’s true that lasso tends to work closer to the original intention, i.e., actually making $\beta_{k}$ into $0$ , more than ridge. However, ending the comparison there strongly suggests the view that lasso regression developed from ridge regression, overcoming some disadvantages to achieve a simpler solution for the optimal. The problem with such simplistic comparisons is that they’re not particularly important in the grand scheme of things; it’s not that the comparison is entirely wrong, but it doesn’t hold much significance.

Ridge regression’s objective function is differentiable, and its optimal solution $\hat{\beta}$ is given in a specific closed form $\hat{\beta} = \left( X^{T} X + \lambda I \right)^{-1} X^{T} Y$ without any additional conditions. This extreme simplicity is a strength of ridge regression in itself. It implies that ridge regression is flexible enough to be immediately adopted in almost all applications based on least squares, and it has a much higher priority for research or development of new methods than lasso regression. Specifically, $\argmin_{\beta} \left( \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{?} \right)$ it’s only worth examining the strengths and weaknesses of ridge and lasso in such straightforward cases. When moving on to more complex and difficult methods based on each, the scalability of ridge regression is so overwhelming that it’s not even comparable to lasso regression. From the perspective of ‘research’ that needs to create new knowledge, not just use ’techniques’ for analysis, they are fundamentally different.

Now, let’s take a closer look at how this beneficial $\hat{\beta}$ is derived.

Formula

Optimal Solution ⁵

$L \left( \beta \right) = \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{2}^{2}$ When $\lambda$ is given as a constant, the objective function $L$ of ridge regression can be represented as above. The optimal solution $\hat{\beta} = \argmin_{\beta} L \left( \beta \right)$ of ridge regression is as follows: $\hat{\beta} = \left( X^{T} X + \lambda I \right)^{-1} X^{T} Y$ Here, $A^{T}$ is the transpose matrix of $A$ , $I$ is the identity matrix, and $A^{-1}$ is the inverse matrix of $A$ .

Derivation

Gradient of Vectors and Matrices: $\frac{ \partial }{ \partial \mathbf{w} }\left( \mathbf{w}^{T}\mathbf{R}\mathbf{w} \right)= \left( \mathbf{R} + \mathbf{R}^{T} \right) \mathbf{w}$

Gradient of Residual Sum of Squares: $f \left( \mathbf{s} \right) := \left( \mathbf{y} - X \mathbf{s} \right)^{T} R \left( \mathbf{y} - X \mathbf{s} \right)$ For vectors $\mathbf{y} \in \mathbb{R}^{n}$ and matrices $X \in \mathbb{R}^{n \times p}$ , $R \in \mathbb{R}^{n \times n}$ independent of $\mathbf{s}$ , the following holds: ${{ \partial f \left( \mathbf{s} \right) } \over { \partial \mathbf{s} }} = - X^{T} \left( R + R^{T} \right) \left( \mathbf{y} - X \mathbf{s} \right)$

Applying the above formulas to the case of $R = I$ , $\begin{align*} {{ \partial } \over { \partial \beta }} L \left( \beta \right) =& {{ \partial } \over { \partial \beta }} \left\| Y - X \beta \right\|_{2}^{2} + {{ \partial } \over { \partial \beta }} \lambda \left\| \beta \right\|_{2}^{2} \\ =& {{ \partial } \over { \partial \beta }} \left( Y - X \beta \right)^{T} \left( Y - X \beta \right) + \lambda {{ \partial } \over { \partial \beta }} \beta^{T} \beta \\ =& - X^{T} \left( I + I^{T} \right) \left( Y - X \beta \right) + \lambda \left( I + I^{T} \right) \beta \\ =& - 2 X^{T} \left( Y - X \beta \right) + 2 \lambda \beta \\ =& - 2 X^{T} Y + 2 X^{T} X \beta + 2 \lambda \beta \end{align*}$ and since $\beta = \hat{\beta}$ must satisfy ${{ \partial } \over { \partial \beta }} L = 0$ , we obtain the following upon rearrangement: $\begin{align*} & X^{T} X \hat{\beta} + \lambda \hat{\beta} = X^{T} Y \\ \implies & \left( X^{T} X + \lambda I \right) \hat{\beta} = X^{T} Y \\ \implies & \hat{\beta} = \left( X^{T} X + \lambda I \right)^{-1} X^{T} Y \end{align*}$

■