logo

Proximal Operator 📂Optimization

Proximal Operator

Definition1

Let $\left\| \cdot \right\|_{X}$ be the norm of the vector space $X$. The proximal operator $\operatorname{prox}_{\lambda f} : X \to 2^{X}$ for a function $f : X \to \mathbb{R}$ is defined as follows, for $\lambda > 0$,

$$ \begin{align} \operatorname{prox}_{\lambda f} (\mathbf{x}) &:= \argmin\limits_{\mathbf{v}} \left\{ \lambda f(\mathbf{v}) + \dfrac{1}{2}\left\| \mathbf{v} - \mathbf{x} \right\|_{X}^{2} : \mathbf{v} \in X \right\} \\ &= \argmin\limits_{\mathbf{v}} \left\{ f(\mathbf{v}) + \dfrac{1}{2\lambda}\left\| \mathbf{v} - \mathbf{x} \right\|_{X}^{2} : \mathbf{v} \in X \right\} \nonumber \end{align} $$

$\argmin$ represents the optimal solution.

Simplified Definition

If the above definition is difficult, it can be understood as follows. The proximal operator $\operatorname{prox}_{\lambda f} : \mathbb{R}^{n} \to \mathbb{R}^{n}$ for a function $f : \mathbb{R}^{n} \to \mathbb{R}$ is defined as:

$$ \operatorname{prox}_{\lambda f} (\mathbf{x}) := \argmin\limits_{\mathbf{v}} \left\{ \lambda f(\mathbf{v}) + \dfrac{1}{2}\left\| \mathbf{v} - \mathbf{x} \right\|_{2}^{2} : \mathbf{v} \in \mathbb{R}^{n} \right\} $$

Explanation 2

According to definition $(1)$, $\operatorname{prox}_{\lambda f} (\mathbf{x})$ means compromising between minimizing the function value of $f$ and getting closer to $\mathbf{x}$. In other words, it can be said that it is a point that makes the value of $f$ the smallest in the vicinity of $\mathbf{x}$. It can be seen as adding a kind of regularization to the problem of optimizing $f$. Depending on the value of $\lambda$, it is determined which term is given more value in reduction. Point $\operatorname{prox}_{\lambda f} (\mathbf{x})$ is called the proximal point for $f$ with respect to $\mathbf{x}$.

Proximal means ‘close’ or ’nearby’, and $\operatorname{prox}_{\lambda f} (\mathbf{x})$ seems to be named proximal because it finds the optimal solution of $f$ among the points close to $\mathbf{x}$.

Functions $f$ that are easy to calculate with $\operatorname{prox}_{\lambda f} (\cdot)$ are called prox friendly.

In the case of $X = \mathbb{R}^{n}$, if $f$ is a convex function, then $\operatorname{prox}_{\lambda f} (\mathbf{x})$ is unique.

Algorithms that solve optimization problems using the proximal operator are called proximal algorithms. These include:

Properties3

  • Separable sum: If the function $f$ is separable as $f(\mathbf{u}, \mathbf{v}) = \phi (\mathbf{u}) + \psi (\mathbf{v})$, then the following holds: $$ \operatorname{prox}_{\lambda f} (\mathbf{x}, \mathbf{y}) = (\operatorname{prox}_{\lambda \phi} (\mathbf{x}), \operatorname{prox}_{\lambda \psi} (\mathbf{y})) $$ If it is completely separable for more than one variable as $f(x) = \sum_{i}^{n}f_{i}(x_{i})$, then, $$ (\operatorname{prox}_{\lambda f} (x))_{i} = \operatorname{prox}_{\lambda f_{i}} (x_{i}) $$
  • Postcomposition: For $f(\mathbf{x}) = \alpha \phi (\mathbf{x}) + b$ and $\alpha \gt 0$, the following holds: $$ \operatorname{prox}_{\lambda f} (\mathbf{x}) = \operatorname{prox}_{\alpha \lambda \phi} (\mathbf{x}) $$
  • Precomposition: For $f(\mathbf{x}) = \phi (\alpha \mathbf{x} + b)$ and $\alpha \ne 0$, the following holds: $$ \operatorname{prox}_{\lambda f} (\mathbf{x}) = \dfrac{1}{\alpha}\left( \operatorname{prox}_{\alpha^{2} \lambda \phi} (\alpha \mathbf{x} + b) - b \right) $$ If $f(\mathbf{x}) = \phi (Q \mathbf{x})$ and $Q$ is orthogonal, then the following holds: $$ \operatorname{prox}_{\lambda f} (\mathbf{x}) = Q^{T}\operatorname{prox}_{\lambda \phi} (Q\mathbf{x}) $$
  • Affine addition: For $f(\mathbf{x}) = \phi (\mathbf{x}) + \mathbf{a}^T\mathbf{x} + b$, the following holds: $$ \operatorname{prox}_{\lambda f} (\mathbf{x}) = \operatorname{prox}_{\lambda \phi} (\mathbf{x} - \lambda \mathbf{a}) $$
  • Regularization: For $f(\mathbf{x}) = \phi (\mathbf{x}) + \frac{\rho}{2} \left\| \mathbf{x} - \mathbf{a} \right\|_{2}^{2}$, the following holds: $$ \operatorname{prox}_{\lambda f} (\mathbf{x}) = \operatorname{prox}_{\tilde{\lambda} \phi} ((\tilde{\lambda}/\lambda)\mathbf{x} + (\rho \tilde{\lambda}) \mathbf{a}) $$ Here, $\tilde{\lambda} = \lambda / (1+\lambda \rho)$.

  1. Parikh, Neal, and Stephen Boyd. “Proximal algorithms.” Foundations and trends® in Optimization 1.3 (2014), p3. ↩︎

  2. https://web.eecs.umich.edu/~fessler/course/598/l/n-05-prox.pdf ↩︎

  3. https://web.stanford.edu/~boyd/papers/pdf/prox_algs.pdf Ch2 Properties ↩︎