Proximal Alternating Linearized Minimization Algorithm (PALM)
📂Optimization Proximal Alternating Linearized Minimization Algorithm (PALM) Overview Jérôme Bolte, Shoham Sabach, and Marc Teboulle introduced an optimization technique called Proximal Alternating Linearized Minimization (PALM) algorithm in their paper Proximal alternating linearized minimization for nonconvex and nonsmooth problems .
Algorithm The method to solve optimization problems like ( 1 ) (1) ( 1 ) is called the Proximal Alternating Linearized Minimization (PALM) algorithm .
This, in simple terms, means performing alternating optimization for two variables using the proximal gradient method .
Description According to the abstract of the paper, the PALM algorithm is introduced as a method to solve optimization problems that are nonconvex and nonsmooth. Consider the following optimization problem .
minimize x , y Ψ ( x , y ) = f ( x ) + g ( y ) + H ( x , y ) (1)
\operatornamewithlimits{minimize}_{x, y} \Psi (x, y) = f(x) + g(y) + H(x, y) \tag{1}
x , y minimize Ψ ( x , y ) = f ( x ) + g ( y ) + H ( x , y ) ( 1 )
Here x ∈ R n , y ∈ R m x \in \mathbb{R}^{n}, y \in \mathbb{R}^{m} x ∈ R n , y ∈ R m and H : R n × R m → R H : \mathbb{R}^{n} \times \mathbb{R}^{m} \to \mathbb{R} H : R n × R m → R is the C 1 C^{1} C 1 function. Also, let’s say f , g , H f, g, H f , g , H is not convex .
Linearized The Linearized in the algorithm’s name means that we intend to approximate H H H linearly. Applying Taylor’s theorem separately to the two variables x x x and y y y , we can approximate H H H as follows.
H ( x , y ) ≈ H ( x ( k ) , y ) + ( x − x ( k ) ) T ∇ x H ( x ( k ) , y ) = H ( x ( k ) , y ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ) ⟩
\begin{align*}
H(x, y) &\approx H(x^{(k)}, y) + (x - x^{(k)})^{T} \nabla_{x}H(x^{(k)},y) \\
&= H(x^{(k)} ,y) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \tag{2} \\
% &\approx \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \\
\end{align*}
H ( x , y ) ≈ H ( x ( k ) , y ) + ( x − x ( k ) ) T ∇ x H ( x ( k ) , y ) = H ( x ( k ) , y ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y )⟩ ( 2 )
H ( x , y ) ≈ H ( x , y ( k ) ) + ( y − y ( k ) ) T ∇ y H ( x , y ( k ) ) = H ( x , y ( k ) ) + ⟨ y − y ( k ) , ∇ y H ( x , y ( k ) ) ⟩
\begin{align*}
H(x, y) &\approx H(x, y^{(k)}) + (y - y^{(k)})^{T} \nabla_{y}H(x,y^{(k)}) \\
&= H(x, y^{(k)}) + \langle y - y^{(k)}, \nabla_{y}H(x,y^{(k)}) \rangle \tag{3} \\
% &\approx \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \\
\end{align*}
H ( x , y ) ≈ H ( x , y ( k ) ) + ( y − y ( k ) ) T ∇ y H ( x , y ( k ) ) = H ( x , y ( k ) ) + ⟨ y − y ( k ) , ∇ y H ( x , y ( k ) )⟩ ( 3 )
Now, let’s say Ψ \Psi Ψ approximated H H H as ( 2 ) (2) ( 2 ) , and Ψ ^ \widehat{\Psi} Ψ and ( 3 ) (3) ( 3 ) approximated it as Φ \Phi Φ .
Ψ ^ ( x , y ) = f ( x ) + g ( y ) + H ( x ( k ) , y ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ) ⟩ Φ ( x , y ) = f ( x ) + g ( y ) + H ( x , y ( k ) ) + ⟨ y − y ( k ) , ∇ y H ( x , y ( k ) ) ⟩
\begin{align*}
\widehat{\Psi}(x,y) &= f(x) + g(y) + H(x^{(k)} ,y) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \\
\Phi (x,y) &= f(x) + g(y) + H(x, y^{(k)}) + \langle y - y^{(k)}, \nabla_{y}H(x,y^{(k)}) \rangle
\end{align*}
Ψ ( x , y ) Φ ( x , y ) = f ( x ) + g ( y ) + H ( x ( k ) , y ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y )⟩ = f ( x ) + g ( y ) + H ( x , y ( k ) ) + ⟨ y − y ( k ) , ∇ y H ( x , y ( k ) )⟩
As seen below, the result would be the same even if we put it this way. But let’s proceed assuming it’s as above.
Ψ ^ ( x , y ) = f ( x ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ) ⟩ Φ ( x , y ) = g ( y ) + ⟨ y − y ( k ) , ∇ y H ( x , y ( k ) ) ⟩
\begin{align*}
\widehat{\Psi}(x,y) &= f(x) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y) \rangle \\
\Phi (x,y) &= g(y) + \langle y - y^{(k)}, \nabla_{y}H(x,y^{(k)}) \rangle
\end{align*}
Ψ ( x , y ) Φ ( x , y ) = f ( x ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y )⟩ = g ( y ) + ⟨ y − y ( k ) , ∇ y H ( x , y ( k ) )⟩
Alternating Now, applying linear approximation and then alternating optimization of the two variables in Ψ \Psi Ψ using the alternating algorithm , problem ( 1 ) (1) ( 1 ) divides into the following two sub-problems.
{ x ( k + 1 ) = arg min x Ψ ^ ( x , y ( k ) ) = arg min x f ( x ) + g ( y ( k ) ) + H ( x ( k ) , y ( k ) ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) ) ⟩ y ( k + 1 ) = arg min y Φ ( x ( k + 1 ) , y ) = arg min y f ( x ( k + 1 ) ) + g ( y ) + H ( x ( k + 1 ) , y ( k ) ) + ⟨ y − y ( k ) , ∇ y H ( x ( k + 1 ) , y ( k ) ) ⟩
\begin{cases}
x^{(k+1)} &= \argmin\limits_{x} \widehat{\Psi}(x, y^{(k)}) \\
&= \argmin\limits_{x} f(x) + g(y^{(k)}) + H(x^{(k)} ,y^{(k)}) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y^{(k)}) \rangle \\
\\
y^{(k+1)} &= \argmin\limits_{y} \Phi (x^{(k+1)}, y) \\
&= \argmin\limits_{y} f(x^{(k+1)}) + g(y) + H(x^{(k+1)}, y^{(k)}) + \langle y - y^{(k)}, \nabla_{y}H(x^{(k+1)},y^{(k)}) \rangle \\
\end{cases}
⎩ ⎨ ⎧ x ( k + 1 ) y ( k + 1 ) = x arg min Ψ ( x , y ( k ) ) = x arg min f ( x ) + g ( y ( k ) ) + H ( x ( k ) , y ( k ) ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) )⟩ = y arg min Φ ( x ( k + 1 ) , y ) = y arg min f ( x ( k + 1 ) ) + g ( y ) + H ( x ( k + 1 ) , y ( k ) ) + ⟨ y − y ( k ) , ∇ y H ( x ( k + 1 ) , y ( k ) )⟩
In the first equation, the second and third terms do not affect the optimization, so they can be removed. Similarly, in the second equation, removing the first and third terms leaves the problem unchanged. Thus, the algorithm can be rewritten as follows.
{ x ( k + 1 ) = arg min x f ( x ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) ) ⟩ y ( k + 1 ) = arg min y g ( y ) + ⟨ y − y ( k ) , ∇ y H ( x ( k + 1 ) , y ( k ) ) ⟩ (4)
\begin{cases}
x^{(k+1)} = \argmin\limits_{x} f(x) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y^{(k)}) \rangle \\
y^{(k+1)} = \argmin\limits_{y} g(y) + \langle y - y^{(k)}, \nabla_{y}H(x^{(k+1)},y^{(k)}) \rangle \\
\end{cases} \tag{4}
⎩ ⎨ ⎧ x ( k + 1 ) = x arg min f ( x ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) )⟩ y ( k + 1 ) = y arg min g ( y ) + ⟨ y − y ( k ) , ∇ y H ( x ( k + 1 ) , y ( k ) )⟩ ( 4 )
Proximal Now, if we add a term for proximal optimization to ( 4 ) (4) ( 4 ) , it becomes as follows.
{ x ( k + 1 ) = arg min x { f ( x ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) ) ⟩ + c 2 ∥ x − x ( k ) ∥ 2 2 } y ( k + 1 ) = arg min y { g ( y ) + ⟨ y − y ( k ) , ∇ y H ( x ( k + 1 ) , y ( k ) ) ⟩ + d 2 ∥ y − y ( k ) ∥ 2 2 } (5)
\begin{cases}
x^{(k+1)} = \argmin\limits_{x} \left\{ f(x) + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y^{(k)}) \rangle + \dfrac{c}{2}\left\| x - x^{(k)} \right\|_{2}^{2} \right\} \tag{5} \\
y^{(k+1)} = \argmin\limits_{y} \left\{ g(y) + \langle y - y^{(k)}, \nabla_{y}H(x^{(k+1)},y^{(k)}) \rangle + \dfrac{d}{2}\left\| y - y^{(k)} \right\|_{2}^{2} \right\} \\
\end{cases}
⎩ ⎨ ⎧ x ( k + 1 ) = x arg min { f ( x ) + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) )⟩ + 2 c x − x ( k ) 2 2 } y ( k + 1 ) = y arg min { g ( y ) + ⟨ y − y ( k ) , ∇ y H ( x ( k + 1 ) , y ( k ) )⟩ + 2 d y − y ( k ) 2 2 } ( 5 )
Explanation 1
Let’s combine the second and third terms in the first equation as follows.
ψ ( x ) = ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) ) ⟩ + c 2 ∥ x − x ( k ) ∥ 2 2
\psi (x) = \langle x - x^{(k)}, \nabla_{x}H(x^{(k)},y^{(k)}) \rangle + \dfrac{c}{2}\left\| x - x^{(k)} \right\|_{2}^{2}
ψ ( x ) = ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) )⟩ + 2 c x − x ( k ) 2 2
If we find the point where ψ \psi ψ is minimized,
∇ x ψ ( x ) = 0 ⟹ ∇ x H ( x ( k ) , y ( k ) ) + c ( x − x ( k ) ) = 0 ⟹ x = x ( k ) − 1 c ∇ x H ( x ( k ) , y ( k ) )
\begin{align*}
\nabla_{x}\psi (x) = 0
&\implies \nabla_{x}H(x^{(k)}, y^{(k)}) + c(x - x^{(k)}) =0 \\
&\implies x = x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \\
\end{align*}
∇ x ψ ( x ) = 0 ⟹ ∇ x H ( x ( k ) , y ( k ) ) + c ( x − x ( k ) ) = 0 ⟹ x = x ( k ) − c 1 ∇ x H ( x ( k ) , y ( k ) )
The point that is not too far from this point and also minimizes f f f can be represented by a proximal operator . Therefore, x ( k + 1 ) x^{(k+1)} x ( k + 1 ) is updated as follows.
x ( k + 1 ) ∈ prox 1 c f ( x ( k ) − 1 c ∇ x H ( x ( k ) , y ( k ) ) )
x^{(k+1)} \in \operatornamewithlimits{prox}_{\frac{1}{c}f} \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right)
x ( k + 1 ) ∈ c 1 f prox ( x ( k ) − c 1 ∇ x H ( x ( k ) , y ( k ) ) )
This is the same as the proximal gradient method .
Explanation 2
The optimization problem ( 5 ) (5) ( 5 ) can be shown to be the same as prox c f ( x ( k ) − 1 c ∇ x H ( x ( k ) , y ( k ) ) ) \operatornamewithlimits{prox}_{cf} \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right) prox c f ( x ( k ) − c 1 ∇ x H ( x ( k ) , y ( k ) ) ) as follows.
prox 1 c f ( x ( k ) − 1 c ∇ x H ( x ( k ) , y ( k ) ) ) = arg min x { f ( x ) + c 2 ∥ x − ( x ( k ) − 1 c ∇ x H ( x ( k ) , y ( k ) ) ) ∥ 2 2 } = arg min x { f ( x ) + c 2 ∥ ( x − x ( k ) ) + 1 c ∇ x H ( x ( k ) , y ( k ) ) ∥ 2 2 } = arg min x { f ( x ) + c 2 ∥ ( x − x ( k ) ) ∥ 2 2 + ( x − x ( k ) ) T ∇ x H ( x ( k ) , y ( k ) ) + 1 c 2 ∥ ∇ x H ( x ( k ) , y ( k ) ) ∥ 2 2 } = arg min x { f ( x ) + c 2 ∥ ( x − x ( k ) ) ∥ 2 2 + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) ) ⟩ }
\begin{align*}
& \operatornamewithlimits{prox}_{\frac{1}{c}f} \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right) \\
&= \argmin\limits_{x} \left\{ f(x) + \dfrac{c}{2} \left\| x - \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right) \right\|_{2}^{2} \right\} \\
&= \argmin\limits_{x} \left\{ f(x) + \dfrac{c}{2} \left\| (x - x^{(k)}) + \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right\|_{2}^{2} \right\} \\
&= \argmin\limits_{x} \left\{ f(x) + \dfrac{c}{2} \left\| (x - x^{(k)}) \right\|_{2}^{2} + (x - x^{(k)})^{T}\nabla_{x}H(x^{(k)}, y^{(k)}) + \dfrac{1}{c^{2}}\left\| \nabla_{x}H(x^{(k)}, y^{(k)}) \right\|_{2}^{2} \right\} \\
&= \argmin\limits_{x} \left\{ f(x) + \dfrac{c}{2} \left\| (x - x^{(k)}) \right\|_{2}^{2} + \langle x - x^{(k)}, \nabla_{x}H(x^{(k)}, y^{(k)}) \rangle \right\} \\
\end{align*}
c 1 f prox ( x ( k ) − c 1 ∇ x H ( x ( k ) , y ( k ) ) ) = x arg min { f ( x ) + 2 c x − ( x ( k ) − c 1 ∇ x H ( x ( k ) , y ( k ) ) ) 2 2 } = x arg min { f ( x ) + 2 c ( x − x ( k ) ) + c 1 ∇ x H ( x ( k ) , y ( k ) ) 2 2 } = x arg min { f ( x ) + 2 c ( x − x ( k ) ) 2 2 + ( x − x ( k ) ) T ∇ x H ( x ( k ) , y ( k ) ) + c 2 1 ∇ x H ( x ( k ) , y ( k ) ) 2 2 } = x arg min { f ( x ) + 2 c ( x − x ( k ) ) 2 2 + ⟨ x − x ( k ) , ∇ x H ( x ( k ) , y ( k ) )⟩ }
Following the explanations above, we finally obtain the following iterative algorithm.
{ x ( k + 1 ) ∈ prox 1 c f ( x ( k ) − 1 c ∇ x H ( x ( k ) , y ( k ) ) ) y ( k + 1 ) ∈ prox 1 d g ( y ( k ) − 1 d ∇ y H ( x ( k + 1 ) , y ( k ) ) )
\begin{cases}
x^{(k+1)} \in \operatornamewithlimits{prox}_{\frac{1}{c}f} \left( x^{(k)} - \dfrac{1}{c}\nabla_{x}H(x^{(k)}, y^{(k)}) \right) \\
y^{(k+1)} \in \operatornamewithlimits{prox}_{\frac{1}{d}g} \left( y^{(k)} - \dfrac{1}{d}\nabla_{y}H(x^{(k+1)}, y^{(k)}) \right) \\
\end{cases}
⎩ ⎨ ⎧ x ( k + 1 ) ∈ prox c 1 f ( x ( k ) − c 1 ∇ x H ( x ( k ) , y ( k ) ) ) y ( k + 1 ) ∈ prox d 1 g ( y ( k ) − d 1 ∇ y H ( x ( k + 1 ) , y ( k ) ) )
■