What is Sparse Regression?
Definition
Let’s consider a matrix equation given by matrix $A \in \mathbb{R}^{m \times n}$ and vector $\mathbf{b} \in \mathbb{R}^{m}$ as follows: $$ A \mathbf{x} = \mathbf{b} $$ Sparse Regression refers to methods that aim to find a solution or least squares solution to such matrix equations, maximizing the sparsity of $\mathbf{x}$, in other words, minimizing the number of non-zero elements in $\mathbf{x}$ as much as possible. When the solution $\mathbf{x}$ has exactly $K$ non-zero components, it is said to be $K$-sparse in $A$1, and Sparse Regression is the process of minimizing this $K$ in a regression problem.
Details
While classical statistical regression analysis originated from using matrix algebra to find regression lines, like ‘regression problems’ themselves have expanded to finding a function that explains $\mathbf{y} \in \mathbb{R}$ through given data $\mathbf{X} \in X$, $$ \mathbf{y} = f \left( \mathbf{X} \right) $$ ‘sparse regression’ should be thought of not as a specific method but as an approach to finding solutions with many non-zero elements.
Occam’s Razor 2
Occam’s Razor: “Among all possible descriptions, the simplest correct model is probably the true one.”
Occam’s Razor, often encountered in scientific philosophy, is also known as the Principle of Parsimony. Simply put, it states that among various hypotheses that adequately describe a phenomenon, the simplest one is likely the best. It suggests cutting off unnecessary explanations with a razor.3
In this context, minimizing $K$ in a $K$-sparse solution adheres to Occam’s principle of parsimony, and it’s generally beneficial to value this principle, not just for engineering benefits but as a universal value worth pursuing. Let’s now examine specific cases of minimizing $K$.
Statistics: Best Subset Selection
This can be summarized as model simplification.
In statistics, the ‘model’ to be ultimately derived is evaluated not just by performance but also by how well it aligns with intuitive logic, mainstream academic opinion, professional knowledge, scientific facts, and social norms. In the 21st century, a period characterized by ubiquitous big data, the value of models that merely fit well to data decreases, especially in situations like non-linear regression analysis where various derived variables can be ‘created’. Excelling in prediction using hundreds of thousands of variables is beneficial, but simplifying it into a form understandable to humans is another challenge.
- Best Subset Selection: In classical statistics, exploring reduced models involves incrementally adding or removing independent variables, adhering to variable selection criteria that match the objective. This iterative algorithmic process essentially involves fixing or freeing the regression coefficients of variables to zero.4 Repeating regression analysis for all variables is an ‘NP-hard’ problem, and though approached naively through greedy algorithms, it cannot be definitively stated that this is not a form of sparse regression.
Machine Learning: Lasso Regression
This can be summarized as dimension reduction.
In machine learning, dimension reduction is not necessarily about simplifying data for human understanding but about reducing variables while losing as little meaningful information as possible. When our dependent variable $y$ is a linear combination of $x_{0}, x_{1} , \cdots , x_{p}$, a zero coefficient for a variable indicates it does not explain the dependent variable. Finding a solution with the most zero coefficients, while still explaining the data well, is akin to dimension reduction.
- Lasso Regression: Unlike typical regression analysis, lasso regression aims to solve problems in the form of $$ Y = X \beta $$ but also wishes to minimize $\left\| \beta \right\|_{1}$ simultaneously. Adding this constraint may not explain the data as well as the solution $\beta$ obtained from regular regression analysis but aims to identify ‘what really matters’ by sacrificing some performance.
Signal Processing: Compressed Sensing
This can be summarized as noise canceling and compression.
Taking the Fast Fourier Transform of a given signal transitions from the Time Domain to the Frequency Domain, representing the signal as a linear combination of sines and cosines. Components considered unimportant or noise in the data will have very small coefficients. For instance, if a signal $y(t)$ is represented as $$ y(t) = 70 \cos ( t ) + 0.03 \cos ( 2 t ) + 0.0002 \cos ( 3 t ) + 24 \cos ( 4 t ) + \cdots $$ then $\cos ( 2 t )$ and $\cos ( 3 t )$ could be deemed negligible and removed. This is sometimes described as eliminating signals that are not the primary mode or have low energy frequencies. Conversely, if a signal can be reconstructed with a few bases, this can be seen as compression by discarding most meaningless data.
- Compressed Sensing: Compressed sensing fundamentally differs from the previously described sparse regression in terms of problem setting, assuming from the outset that $A \in \mathbb{R}^{m \times n}$ is underdetermined, meaning there are infinitely many solutions and the goal is to find the sparsest one. In the context of signal processing, a component of $\mathbf{x}$ being zero means it’s unnecessary data, thus compression, and non-zero components mean necessary data has been sensed, making compressed sensing an apt term.
Brunton. (2022). Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control(2nd Edition): p97. ↩︎
Brunton. (2022). Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control(2nd Edition): p114. ↩︎
Occam was a 14th-century figure, and razors back then were not as refined and safe as today, leading to dangerous shaving accidents. ↩︎
https://cims.nyu.edu/~cfgranda/pages/OBDA_spring16/material/sparse_regression.pdf ↩︎