Proximal Alternating Linearized Minimization Algorithm (PALM)
Overview
Jérôme Bolte, Shoham Sabach, and Marc Teboulle introduced an optimization technique called Proximal Alternating Linearized Minimization (PALM) algorithm in their paper Proximal alternating linearized minimization for nonconvex and nonsmooth problems.
Algorithm
The method to solve optimization problems like $(1)$ is called the Proximal Alternating Linearized Minimization (PALM) algorithm.
This, in simple terms, means performing alternating optimization for two variables using the proximal gradient method.
Description1 2
According to the abstract of the paper, the PALM algorithm is introduced as a method to solve optimization problems that are nonconvex and nonsmooth. Consider the following optimization problem.
$$ \operatornamewithlimits{minimize}_{x, y} \Psi (x, y) = f(x) + g(y) + H(x, y) \tag{1} $$
Here $x \in \mathbb{R}^{n}, y \in \mathbb{R}^{m}$ and $H : \mathbb{R}^{n} \times \mathbb{R}^{m} \to \mathbb{R}$ is the $C^{1}$ function. Also, let’s say $f, g, H$ is not convex.
Linearized
The Linearized in the algorithm’s name means that we intend to approximate $H$ linearly. Applying Taylor’s theorem separately to the two variables $x$ and $y$, we can approximate $H$ as follows.
$$ \begin{align*} H(x, y) &\approx H(x^{(k)}, y) + (x - x^{(k)})^{T} \nabla_{x}H(x^{(k)},y) \\ &= H(x^{(k)} ,y) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \tag{2} \\ % &\approx \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \\ \end{align*} $$
$$ \begin{align*} H(x, y) &\approx H(x, y^{(k)}) + (y - y^{(k)})^{T} \nabla_{y}H(x,y^{(k)}) \\ &= H(x, y^{(k)}) + \langle y - y^{(k)}, \nabla_{y}H(x,y^{(k)}) \rangle \tag{3} \\ % &\approx \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \\ \end{align*} $$
Now, let’s say $\Psi$ approximated $H$ as $(2)$, and $\widehat{\Psi}$ and $(3)$ approximated it as $\Phi$.
$$ \begin{align*} \widehat{\Psi}(x,y) &= f(x) + g(y) + H(x^{(k)} ,y) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \\ \Phi (x,y) &= f(x) + g(y) + H(x, y^{(k)}) + \langle y - y^{(k)}, \nabla_{y}H(x,y^{(k)}) \rangle \end{align*} $$
As seen below, the result would be the same even if we put it this way. But let’s proceed assuming it’s as above.
$$ \begin{align*} \widehat{\Psi}(x,y) &= f(x) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \\ \Phi (x,y) &= g(y) + \langle y - y^{(k)}, \nabla_{y}H(x,y^{(k)}) \rangle \end{align*} $$
Alternating
Now, applying linear approximation and then alternating optimization of the two variables in $\Psi$ using the alternating algorithm, problem $(1)$ divides into the following two sub-problems.
$$ \begin{cases} x^{(k+1)} &= \argmin\limits_{x} \widehat{\Psi}(x, y^{(k)}) \\ &= \argmin\limits_{x} f(x) + g(y^{(k)}) + H(x^{(k)} ,y^{(k)}) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y^{(k)}) \rangle \\ \\ y^{(k+1)} &= \argmin\limits_{y} \Phi (x^{(k+1)}, y) \\ &= \argmin\limits_{y} f(x^{(k+1)}) + g(y) + H(x^{(k+1)}, y^{(k)}) + \langle y - y^{(k)}, \nabla_{y}H(x^{(k+1)},y^{(k)}) \rangle \\ \end{cases} $$
In the first equation, the second and third terms do not affect the optimization, so they can be removed. Similarly, in the second equation, removing the first and third terms leaves the problem unchanged. Thus, the algorithm can be rewritten as follows.
$$ \begin{cases} x^{(k+1)} = \argmin\limits_{x} f(x) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y^{(k)}) \rangle \\ y^{(k+1)} = \argmin\limits_{y} g(y) + \langle y - y^{(k)}, \nabla_{y}H(x^{(k+1)},y^{(k)}) \rangle \\ \end{cases} \tag{4} $$
Proximal
Now, if we add a term for proximal optimization to $(4)$, it becomes as follows.
$$ \begin{cases} x^{(k+1)} = \argmin\limits_{x} \left\{ f(x) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y^{(k)}) \rangle + \dfrac{c}{2}\left\| x - x^{(k)} \right\|_{2}^{2} \right\} \tag{5} \\ y^{(k+1)} = \argmin\limits_{y} \left\{ g(y) + \langle y - y^{(k)}, \nabla_{y}H(x^{(k+1)},y^{(k)}) \rangle + \dfrac{d}{2}\left\| y - y^{(k)} \right\|_{2}^{2} \right\} \\ \end{cases} $$
Explanation 1
Let’s combine the second and third terms in the first equation as follows.
$$ \psi (x) = \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y^{(k)}) \rangle + \dfrac{c}{2}\left\| x - x^{(k)} \right\|_{2}^{2} $$
If we find the point where $\psi$ is minimized,
$$ \begin{align*} \nabla_{x}\psi (x) = 0 &\implies \nabla_{x}H(x^{(k)}, y^{(k)}) + c(x - x^{(k)}) =0 \\ &\implies x = x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \\ \end{align*} $$
The point that is not too far from this point and also minimizes $f$ can be represented by a proximal operator. Therefore, $x^{(k+1)}$ is updated as follows.
$$ x^{(k+1)} \in \operatornamewithlimits{prox}_{\frac{1}{c}f} \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right) $$
This is the same as the proximal gradient method.
Explanation 2
The optimization problem $(5)$ can be shown to be the same as $\operatornamewithlimits{prox}_{cf} \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right)$ as follows.
$$ \begin{align*} & \operatornamewithlimits{prox}_{\frac{1}{c}f} \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right) \\ &= \argmin\limits_{x} \left\{ f(x) + \dfrac{c}{2} \left\| x - \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right) \right\|_{2}^{2} \right\} \\ &= \argmin\limits_{x} \left\{ f(x) + \dfrac{c}{2} \left\| (x - x^{(k)}) + \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right\|_{2}^{2} \right\} \\ &= \argmin\limits_{x} \left\{ f(x) + \dfrac{c}{2} \left\| (x - x^{(k)}) \right\|_{2}^{2} + (x - x^{(k)})^{T}\nabla_{x}H(x^{(k)}, y^{(k)}) + \dfrac{1}{c^{2}}\left\| \nabla_{x}H(x^{(k)}, y^{(k)}) \right\|_{2}^{2} \right\} \\ &= \argmin\limits_{x} \left\{ f(x) + \dfrac{c}{2} \left\| (x - x^{(k)}) \right\|_{2}^{2} + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)}, y^{(k)}) \rangle \right\} \\ \end{align*} $$
Following the explanations above, we finally obtain the following iterative algorithm.
$$ \begin{cases} x^{(k+1)} \in \operatornamewithlimits{prox}_{\frac{1}{c}f} \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right) \\ y^{(k+1)} \in \operatornamewithlimits{prox}_{\frac{1}{d}g} \left( y^{(k)} - \dfrac{1}{d}\nabla_{y}H(x^{(k+1)}, y^{(k)}) \right) \\ \end{cases} $$
■
Bolte, Jérôme, Shoham Sabach, and Marc Teboulle. “Proximal alternating linearized minimization for nonconvex and nonsmooth problems.” Mathematical Programming 146.1-2 (2014): 459-494. ↩︎