ISTA: Iterative Soft Thresholding Algorithm 📂Optimization

ISTA: Iterative Soft Thresholding Algorithm

Algorithm

$$ \argmin_{\beta} \left( {{ 1 } \over { 2 }} \left\| Y - X \beta \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{1} \right) $$

As a method to solve the BPDN, or Lasso regression problem above, the following update rule is called ISTA. $$ \beta^{(k+1)} = S_{\lambda} \left( \beta^{(k)} - \alpha X^{T} \left( X \beta^{(k)} - Y \right) \right) $$ Here $\alpha$ is the step size, and the vectorized soft-thresholding $S_{\lambda} : \mathbb{R}^{n} \to \mathbb{R}^{n}$ is the function that applies the soft-thresholding $\eta_{S} \left( x ; \lambda \right)$ componentwise to a vector.

Explanation

ISTA is a type of proximal gradient method. Unlike the least squares solution, which can be solved conveniently via matrix algebra, ISTA is useful when one wants a solution concentrated near the zero vector.

If $\lambda$ is given as a constant, write the Lasso objective function $L$ as above. In general, the optimal solution of Lasso does not have a closed form, but if we assume that all columns of $X$ are mutually orthogonal, then the $k$-th component $\left( \hat{\beta} \right)_{k}$ of ▷eq07◯ is given by: $$ \begin{align*} \left( \hat{\beta} \right)_{k} =& {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \eta_{\lambda} \left( X^{T} Y \right)_{k} \\ = & {{ 1 } \over { \left( X^{T} X \right)_{kk} }} \begin{cases} \left( X^{T} Y \right)_{k} + \lambda & , \text{if } \left( X^{T} Y \right)_{k} < - \lambda \\ 0 & , \text{if } \left( X^{T} Y \right)_{k} \in [-\lambda, \lambda] \\ \left( X^{T} Y \right)_{k} - \lambda & , \text{if } \left( X^{T} Y \right)_{k} > \lambda \end{cases} \end{align*} $$

In fact, for BPDN when the design matrix is full-rank, the Lasso solution can be expressed as above; even when it is not, a method that performs Lasso regression is ISTA. Understanding this theoretical background in formulaic terms is important, since it is unavoidable when learning the improved algorithm FISTA.

Derivation

$S_{\lambda}$ itself is a function that moves the solution toward the zero vector and can be viewed as reflecting a proximal operator, so in BPDN one only needs to attend to the quadratic term $L = \left\| Y - X \beta \right\|_{2}^{2} / 2$.

Gradient of the residual sum of squares: $$ f \left( \mathbf{s} \right) := \left( \mathbf{y} - X \mathbf{s} \right)^{T} R \left( \mathbf{y} - X \mathbf{s} \right) $$ For a vector $\mathbf{y} \in \mathbb{R}^{n}$ and matrices $X \in \mathbb{R}^{n \times p}$, $R \in \mathbb{R}^{n \times n}$ that are independent of $\mathbf{s}$, the following holds. $$ {{ \partial f \left( \mathbf{s} \right) } \over { \partial \mathbf{s} }} = - X^{T} \left( R + R^{T} \right) \left( \mathbf{y} - X \mathbf{s} \right) $$

Applying the above identities to the case $R = I$ gives $$ \begin{align*} & {{ \partial } \over { \partial \beta }} L \\ =& {{ 1 } \over { 2 }} {{ \partial } \over { \partial \beta }} \left\| Y - X \beta \right\|_{2}^{2} \\ =& {{ \partial } \over { \partial \beta }} {{ 1 } \over { 2 }} \left( Y - X \beta \right)^{T} \left( Y - X \beta \right) \\ =& - {{ 1 } \over { 2 }} X^{T} \left( I + I^{T} \right) \left( Y - X \beta \right) \\ =& X^{T} \left( X \beta - Y \right) \end{align*} $$ which is itself a gradient, so in terms of gradient descent we have $$ \beta^{(k+1)} = \beta^{(k)} - \alpha \nabla L $$ and applying the vectorized soft-thresholding to this yields the ISTA update rule.

■