logo

Proximal Gradient Method 📂Optimization

Proximal Gradient Method

Definition 1

Let’s say the non-differentiable objective function H(x):RnRH(\mathbf{x}) : \mathbb{R}^{n} \to \mathbb{R} is decomposed into a differentiable function ff and a non-differentiable function gg.

H(x)=f(x)+g(x) H(\mathbf{x}) = f(\mathbf{x}) + g(\mathbf{x})

The method of solving the optimization problem for HH using the following iterative algorithm is called the proximal gradient method.

x(k+1)=proxλg(x(k)λf(x(k))) \mathbf{x}^{(k+1)} = \operatorname{prox}_{\lambda g}(\mathbf{x}^{(k)} - \lambda \nabla f(\mathbf{x}^{(k)}))

Explanation

It is called the proximal gradient method because it uses both gradient descent and proximal optimization.

Derivation

※ The method of deriving the algorithm by approximating f(x)f(\mathbf{x}) with Taylor’s theorem is not introduced because the formula development is thought to be not smooth and difficult to understand the algorithm formula.

The function we want to optimize is HH, and we think of dividing it into a differentiable part ff and a non-differentiable part gg. In other words, we want to optimize both ff and gg at the same time. However, since ff is differentiable, the next step’s optimal point can be obtained through gradient descent.

xf=xλf(x) \mathbf{x}_{f} = \mathbf{x} - \lambda \nabla f(\mathbf{x})

A point that optimizes gg while not being far away can be obtained through proximal optimization.

x=proxλg(xf)=arg minv{g(v)+12λvxf22:vRn} \mathbf{x}_{\ast} = \operatorname{prox}_{\lambda g}(\mathbf{x}_{f}) = \argmin\limits_{\mathbf{v}} \left\{ g(\mathbf{v}) + \dfrac{1}{2\lambda}\left\| \mathbf{v} - \mathbf{x}_{f} \right\|_{2}^{2} : \mathbf{v} \in \mathbb{R}^{n} \right\}

Therefore, we obtain the following formula for updating the optimal solution.

x(k+1)=proxλg(x(k)λf(x(k)))=arg minv{g(v)+12λv(x(k)λf(x(k)))22:vRn} \begin{align*} \mathbf{x}^{(k+1)} &= \operatorname{prox}_{\lambda g}(\mathbf{x}^{(k)} - \lambda \nabla f(\mathbf{x}^{(k)})) \\ &= \argmin\limits_{\mathbf{v}} \left\{ g(\mathbf{v}) + \dfrac{1}{2\lambda}\left\| \mathbf{v} - \left( \mathbf{x}^{(k)} - \lambda \nabla f(\mathbf{x}^{(k)}) \right) \right\|_{2}^{2} : \mathbf{v} \in \mathbb{R}^{n} \right\} \end{align*}


  1. Parikh, Neal, and Stephen Boyd. “Proximal algorithms.” Foundations and trends® in Optimization 1.3 (2014), Ch4.2 Proximal gradient method ↩︎