Maximum a Posteriori Estimation for Linear Regression Model in Machine Learning
📂Machine Learning Maximum a Posteriori Estimation for Linear Regression Model in Machine Learning Theorem Assume that the relationship between data x i ∈ R n \mathbf{x}_{i} \in \mathbb{R}^{n} x i ∈ R n and its labels y i ∈ R y_{i} \in \mathbb{R} y i ∈ R can be expressed by the following linear model .
y i = w T x i + ϵ i , i = 1 , … , K (1)
y_{i} = \mathbf{w}^{\mathsf{T}} \mathbf{x}_{i} + \epsilon_{i}, \qquad i = 1, \ldots, K \tag{1}
y i = w T x i + ϵ i , i = 1 , … , K ( 1 )
The parameter w MAP \mathbf{w}_{\text{MAP}} w MAP that maximizes the posterior probability is as follows. For y = [ y 1 ⋯ y K ] T \mathbf{y} = \begin{bmatrix} y_{1} & \cdots & y_{K} \end{bmatrix}^{\mathsf{T}} y = [ y 1 ⋯ y K ] T and X = [ x 1 ⋯ x K ] T ∈ R n × K \mathbf{X} = \begin{bmatrix} \mathbf{x}_{1} & \cdots & \mathbf{x}_{K} \end{bmatrix}^{T} \in \mathbb{R}^{n \times K} X = [ x 1 ⋯ x K ] T ∈ R n × K ,
When the prior distribution is normal:
w MAP = ( 1 σ 2 X T X + Σ − 1 ) − 1 ( 1 σ 2 X T y + Σ − 1 μ )
\mathbf{w}_{\text{MAP}} = \left(\dfrac{1}{\sigma^{2}} \mathbf{X}^{T}\mathbf{X} + \boldsymbol{\Sigma}^{-1} \right)^{-1} \left(\dfrac{1}{\sigma^{2}} \mathbf{X}^{T} \mathbf{y} + \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \right)
w MAP = ( σ 2 1 X T X + Σ − 1 ) − 1 ( σ 2 1 X T y + Σ − 1 μ )
In this case, μ \boldsymbol{\mu} μ and Σ \boldsymbol{\Sigma} Σ are the mean vector and covariance matrix of the prior distribution of w \mathbf{w} w , respectively.
When the prior distribution is Laplace:
An explicit form of the optimal solution does not exist.
Explanation Looking at the below ( 3 ) (3) ( 3 ) , assuming that the prior distribution of w \mathbf{w} w is the standard normal distribution N ( 0 , I ) N(\mathbf{0}, \mathbf{I}) N ( 0 , I ) , it becomes clear that this is a problem similar to ridge regression .
arg min w [ 1 2 σ 2 ∥ y − X w ∥ 2 2 + 1 2 ∥ w ∥ 2 2 ]
\argmin_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} + \dfrac{1}{2} \| \mathbf{w} \|_{2}^{2} \right]
w arg min [ 2 σ 2 1 ∥ y − Xw ∥ 2 2 + 2 1 ∥ w ∥ 2 2 ]
On the other hand, assuming the prior distribution is Laplace distribution Laplace ( 0 , b ) \operatorname{Laplace}(0, b) Laplace ( 0 , b ) , it becomes a problem similar to lasso regression .
arg min w [ 1 2 σ 2 ∥ y − X w ∥ 2 2 + 1 b ∥ w ∥ 1 ]
\argmin_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} + \dfrac{1}{b} \| \mathbf{w} \|_{1} \right]
w arg min [ 2 σ 2 1 ∥ y − Xw ∥ 2 2 + b 1 ∥ w ∥ 1 ]
Prior Distribution = Normal Distribution In ( 1 ) (1) ( 1 ) , w ∈ R n \mathbf{w} \in \mathbb{R}^{n} w ∈ R n is a parameter , and let us assume ϵ i ∼ N ( 0 , σ 2 ) \epsilon_{i} \sim N(0, \sigma^{2}) ϵ i ∼ N ( 0 , σ 2 ) is Gaussian noise. Assuming ϵ i \epsilon_{i} ϵ i follows N ( 0 , σ 2 ) N(0, \sigma^{2}) N ( 0 , σ 2 ) , y i = w T x i + ϵ i y_{i} = \mathbf{w}^{\mathsf{T}} \mathbf{x}_{i} + \epsilon_{i} y i = w T x i + ϵ i follows N ( w T x i , σ 2 ) N(\mathbf{w}^{\mathsf{T}} \mathbf{x}_{i}, \sigma^{2}) N ( w T x i , σ 2 ) .
y i ∼ N ( w T x i , σ 2 )
y_{i} \sim N(\mathbf{w}^{\mathsf{T}} \mathbf{x}_{i}, \sigma^{2})
y i ∼ N ( w T x i , σ 2 )
Maximum a posteriori estimation is about finding w MAP \mathbf{w}_{\text{MAP}} w MAP that satisfies the following.
w MAP = arg max w p ( y ∣ w , X ) p ( w ) (2)
\mathbf{w}_{\text{MAP}} = \argmax_{\mathbf{w}} p(\mathbf{y} | \mathbf{w}, \mathbf{X}) p(\mathbf{w}) \tag{2}
w MAP = w arg max p ( y ∣ w , X ) p ( w ) ( 2 )
p ( y ∣ w , X ) p(\mathbf{y} | \mathbf{w}, \mathbf{X}) p ( y ∣ w , X ) is the likelihood , and p ( w ) p(\mathbf{w}) p ( w ) is the prior probability . The likelihood function is as follows.
p ( y ∣ w , X ) = 1 ( 2 π σ 2 ) K / 2 exp [ − 1 2 σ 2 ∥ y − X w ∥ 2 2 ]
p(\mathbf{y} | \mathbf{w}, \mathbf{X})
= \dfrac{1}{(2\pi \sigma^{2})^{K/2}} \exp \left[ -\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} \right]
p ( y ∣ w , X ) = ( 2 π σ 2 ) K /2 1 exp [ − 2 σ 2 1 ∥ y − Xw ∥ 2 2 ]
And let’s assume that the prior distribution of w \mathbf{w} w follows the following multivariate normal distribution .
w ∼ N ( μ , Σ ) , p ( w ) = 1 ( 2 π ) n det Σ exp [ − 1 2 ( w − μ ) T Σ − 1 ( w − μ ) ]
\mathbf{w} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma}), \qquad p(\mathbf{w}) = \dfrac{1}{\sqrt{(2\pi)^{n} \det \boldsymbol{\Sigma}}} \exp \left[ -\dfrac{1}{2} (\mathbf{w} - \boldsymbol{\mu})^{\mathsf{T}} \boldsymbol{\Sigma}^{-1} (\mathbf{w} - \boldsymbol{\mu}) \right]
w ∼ N ( μ , Σ ) , p ( w ) = ( 2 π ) n det Σ 1 exp [ − 2 1 ( w − μ ) T Σ − 1 ( w − μ ) ]
Since the posterior probability is represented as an exponential function, considering the log likelihood is more convenient for calculation.
w MAP = arg max w log ( p ( y ∣ w , X ) p ( w ) ) = arg max w [ − 1 2 σ 2 ∥ y − X w ∥ 2 2 − 1 2 ( w − μ ) T Σ − 1 ( w − μ ) ] = arg min w [ 1 2 σ 2 ∥ y − X w ∥ 2 2 + 1 2 ( w − μ ) T Σ − 1 ( w − μ ) ]
\begin{align*}
\mathbf{w}_{\text{MAP}}
&= \argmax_{\mathbf{w}} \log (p(\mathbf{y} | \mathbf{w}, \mathbf{X}) p(\mathbf{w})) \\
&= \argmax_{\mathbf{w}} \left[-\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} -\dfrac{1}{2} (\mathbf{w} - \boldsymbol{\mu})^{\mathsf{T}} \boldsymbol{\Sigma}^{-1} (\mathbf{w} - \boldsymbol{\mu}) \right] \\
&= \argmin_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} + \dfrac{1}{2} (\mathbf{w} - \boldsymbol{\mu})^{\mathsf{T}} \boldsymbol{\Sigma}^{-1} (\mathbf{w} - \boldsymbol{\mu}) \right] \tag{3}\\
\end{align*}
w MAP = w arg max log ( p ( y ∣ w , X ) p ( w )) = w arg max [ − 2 σ 2 1 ∥ y − Xw ∥ 2 2 − 2 1 ( w − μ ) T Σ − 1 ( w − μ ) ] = w arg min [ 2 σ 2 1 ∥ y − Xw ∥ 2 2 + 2 1 ( w − μ ) T Σ − 1 ( w − μ ) ] ( 3 )
Thus, by differentiating the equation above so that 0 \mathbf{0} 0 is satisfied, w \mathbf{w} w becomes w MAP \mathbf{w}_{\text{MAP}} w MAP . Let’s calculate the gradient .
∇ w [ 1 2 σ 2 ∥ y − X w ∥ 2 2 + 1 2 ( w − μ ) T Σ − 1 ( w − μ ) ] = ∇ w [ 1 2 σ 2 ( y − X w ) T ( y − X w ) + 1 2 ( w − μ ) T Σ − 1 ( w − μ ) ] = ∇ w [ 1 2 σ 2 ( y − X w ) T ( y − X w ) ] + ∇ w [ 1 2 ( w − μ ) T Σ − 1 ( w − μ ) ]
\begin{align*}
& \nabla_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} +\dfrac{1}{2} (\mathbf{w} - \boldsymbol{\mu})^{\mathsf{T}} \boldsymbol{\Sigma}^{-1} (\mathbf{w} - \boldsymbol{\mu}) \right] \\
&= \nabla_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} (\mathbf{y} - \mathbf{X}\mathbf{w})^{T}(\mathbf{y} - \mathbf{X}\mathbf{w}) +\dfrac{1}{2} (\mathbf{w} - \boldsymbol{\mu})^{\mathsf{T}} \boldsymbol{\Sigma}^{-1} (\mathbf{w} - \boldsymbol{\mu}) \right] \\
&= \nabla_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} (\mathbf{y} - \mathbf{X}\mathbf{w})^{T}(\mathbf{y} - \mathbf{X}\mathbf{w})\right] + \nabla_{\mathbf{w}} \left[\dfrac{1}{2} (\mathbf{w} - \boldsymbol{\mu})^{\mathsf{T}} \boldsymbol{\Sigma}^{-1} (\mathbf{w} - \boldsymbol{\mu}) \right] \\
\end{align*}
∇ w [ 2 σ 2 1 ∥ y − Xw ∥ 2 2 + 2 1 ( w − μ ) T Σ − 1 ( w − μ ) ] = ∇ w [ 2 σ 2 1 ( y − Xw ) T ( y − Xw ) + 2 1 ( w − μ ) T Σ − 1 ( w − μ ) ] = ∇ w [ 2 σ 2 1 ( y − Xw ) T ( y − Xw ) ] + ∇ w [ 2 1 ( w − μ ) T Σ − 1 ( w − μ ) ]
Refer to the differentiation rules below.
Derivatives of Vectors and Matrices
Inner product
∂ f ( x ) ∂ x = ∂ ( w T x ) ∂ x = ∂ ( x T w ) ∂ x = w
\frac{ \partial f(\mathbf{x})}{ \partial \mathbf{x} } = \frac{ \partial (\mathbf{w}^{T}\mathbf{x})}{ \partial \mathbf{x} } = \frac{ \partial (\mathbf{x}^{T}\mathbf{w})}{ \partial \mathbf{x} } = \mathbf{w}
∂ x ∂ f ( x ) = ∂ x ∂ ( w T x ) = ∂ x ∂ ( x T w ) = w
Norm
∇ f ( x ) = ∂ ∥ x ∥ 2 ∂ x = ∂ ( x T x ) ∂ x = 2 x
\nabla f(\mathbf{x}) = \dfrac{\partial \left\| \mathbf{x} \right\|^{2}}{\partial \mathbf{x}} = \dfrac{\partial (\mathbf{x}^{T}\mathbf{x})}{\partial \mathbf{x}} = 2\mathbf{x}
∇ f ( x ) = ∂ x ∂ ∥ x ∥ 2 = ∂ x ∂ ( x T x ) = 2 x
Quadratic form
∂ f ( x ) ∂ x = ∂ ( x R x ) ∂ x = ( R + R T ) x
\dfrac{\partial f(\mathbf{x})}{\partial \mathbf{x}} = \dfrac{\partial (\mathbf{x}\mathbf{R}\mathbf{x})}{\partial \mathbf{x}} = (\mathbf{R} + \mathbf{R}^{T})\mathbf{x}
∂ x ∂ f ( x ) = ∂ x ∂ ( xRx ) = ( R + R T ) x
If R \mathbf{R} R is a symmetric matrix ,
∂ f ( x ) ∂ x = 2 R x
\dfrac{\partial f(\mathbf{x})}{\partial \mathbf{x}} = 2\mathbf{R}\mathbf{x}
∂ x ∂ f ( x ) = 2 Rx
Calculating the differentiation starting with the first term, we have the following.
∇ w [ 1 2 σ 2 ( y − X w ) T ( y − X w ) ] = ∇ w [ 1 2 σ 2 ( y T y − 2 y T X w + w T X T X w ) ] = 1 2 σ 2 [ − 2 X T y + 2 X T X w ] = 1 σ 2 [ − X T y + X T X w ] = − 1 σ 2 X T ( y − X w )
\begin{align*}
&\nabla_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} (\mathbf{y} - \mathbf{X}\mathbf{w})^{T}(\mathbf{y} - \mathbf{X}\mathbf{w})\right] \\
&=\nabla_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} \left( \mathbf{y}^{T}\mathbf{y} - 2\mathbf{y}^{T}\mathbf{X}\mathbf{w} + \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right) \right] \\
&= \dfrac{1}{2\sigma^{2}}\left[- 2\mathbf{X}^{T}\mathbf{y} + 2\mathbf{X}^{T}\mathbf{X}\mathbf{w} \right] \\
&= \dfrac{1}{\sigma^{2}}\left[-\mathbf{X}^{T}\mathbf{y} + \mathbf{X}^{T}\mathbf{X}\mathbf{w} \right] \\
&= -\dfrac{1}{\sigma^{2}} \mathbf{X}^{T} \left(\mathbf{y} - \mathbf{X}\mathbf{w} \right) \\
\end{align*}
∇ w [ 2 σ 2 1 ( y − Xw ) T ( y − Xw ) ] = ∇ w [ 2 σ 2 1 ( y T y − 2 y T Xw + w T X T Xw ) ] = 2 σ 2 1 [ − 2 X T y + 2 X T Xw ] = σ 2 1 [ − X T y + X T Xw ] = − σ 2 1 X T ( y − Xw )
Calculating the second term yields the following. The covariance matrix being a symmetric matrix, and the inverse of a symmetric matrix is also symmetric ,
∇ w [ 1 2 ( w − μ ) T Σ − 1 ( w − μ ) ] = Σ − 1 ( w − μ )
\nabla_{\mathbf{w}} \left[\dfrac{1}{2} (\mathbf{w} - \boldsymbol{\mu})^{\mathsf{T}} \boldsymbol{\Sigma}^{-1} (\mathbf{w} - \boldsymbol{\mu}) \right] \\
= \boldsymbol{\Sigma}^{-1}(\mathbf{w} - \boldsymbol{\mu})
∇ w [ 2 1 ( w − μ ) T Σ − 1 ( w − μ ) ] = Σ − 1 ( w − μ )
Thus, we obtain the following.
− 1 σ 2 X T ( y − X w MAP ) + Σ − 1 ( w MAP − μ ) = 0
-\dfrac{1}{\sigma^{2}} \mathbf{X}^{T} \left(\mathbf{y} - \mathbf{X}\mathbf{w}_{\text{MAP}} \right) + \boldsymbol{\Sigma}^{-1}(\mathbf{w}_{\text{MAP}} - \boldsymbol{\mu}) = \mathbf{0}
− σ 2 1 X T ( y − X w MAP ) + Σ − 1 ( w MAP − μ ) = 0
To find w MAP \mathbf{w}_{\text{MAP}} w MAP , solving the above equation yields:
− 1 σ 2 X T y + 1 σ 2 X T X w MAP + Σ − 1 w MAP − Σ − 1 μ = 0 ⟹ ( 1 σ 2 X T X + Σ − 1 ) w MAP = 1 σ 2 X T y + Σ − 1 μ ⟹ w MAP = ( 1 σ 2 X T X + Σ − 1 ) − 1 ( 1 σ 2 X T y + Σ − 1 μ )
\begin{align*}
&& - \dfrac{1}{\sigma^{2}} \mathbf{X}^{T} \mathbf{y} + \dfrac{1}{\sigma^{2}} \mathbf{X}^{T}\mathbf{X}\mathbf{w}_{\text{MAP}} + \boldsymbol{\Sigma}^{-1} \mathbf{w}_{\text{MAP}} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} = \mathbf{0} \\
\implies && \left(\dfrac{1}{\sigma^{2}} \mathbf{X}^{T}\mathbf{X} + \boldsymbol{\Sigma}^{-1} \right) \mathbf{w}_{\text{MAP}} = \dfrac{1}{\sigma^{2}} \mathbf{X}^{T} \mathbf{y} + \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \\
\implies && \mathbf{w}_{\text{MAP}} = \left(\dfrac{1}{\sigma^{2}} \mathbf{X}^{T}\mathbf{X} + \boldsymbol{\Sigma}^{-1} \right)^{-1} \left(\dfrac{1}{\sigma^{2}} \mathbf{X}^{T} \mathbf{y} + \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \right)
\end{align*}
⟹ ⟹ − σ 2 1 X T y + σ 2 1 X T X w MAP + Σ − 1 w MAP − Σ − 1 μ = 0 ( σ 2 1 X T X + Σ − 1 ) w MAP = σ 2 1 X T y + Σ − 1 μ w MAP = ( σ 2 1 X T X + Σ − 1 ) − 1 ( σ 2 1 X T y + Σ − 1 μ )
Prior Distribution = Laplace Distribution Several assumptions and calculations are the same as above except for the prior distribution being the Laplace distribution. Maximum a posteriori estimation is about finding w MAP \mathbf{w}_{\text{MAP}} w MAP that satisfies the following.
w MAP = arg max w p ( y ∣ w , X ) p ( w )
\mathbf{w}_{\text{MAP}} = \argmax_{\mathbf{w}} p(\mathbf{y} | \mathbf{w}, \mathbf{X}) p(\mathbf{w})
w MAP = w arg max p ( y ∣ w , X ) p ( w )
p ( y ∣ w , X ) p(\mathbf{y} | \mathbf{w}, \mathbf{X}) p ( y ∣ w , X ) is the likelihood , and p ( w ) p(\mathbf{w}) p ( w ) is the prior probability . The likelihood function is as follows.
p ( y ∣ w , X ) = 1 ( 2 π σ 2 ) K / 2 exp [ − 1 2 σ 2 ∥ y − X w ∥ 2 2 ]
p(\mathbf{y} | \mathbf{w}, \mathbf{X})
= \dfrac{1}{(2\pi \sigma^{2})^{K/2}} \exp \left[ -\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} \right]
p ( y ∣ w , X ) = ( 2 π σ 2 ) K /2 1 exp [ − 2 σ 2 1 ∥ y − Xw ∥ 2 2 ]
Assume for the prior distribution of w \mathbf{w} w that each w i w_{i} w i independently follows Laplace distribution Laplace ( μ , b ) \operatorname{Laplace}(\mu, b) Laplace ( μ , b ) .
p ( w ) = ∏ i = 1 n 1 2 b exp ( − ∣ w i − μ ∣ b )
p(\mathbf{w}) = \prod_{i=1}^{n} \dfrac{1}{2b} \exp \left( -\dfrac{|w_{i} - \mu|}{b} \right)
p ( w ) = i = 1 ∏ n 2 b 1 exp ( − b ∣ w i − μ ∣ )
⟹ log p ( w ) = − n log 2 b − ∑ i = 1 n [ ∣ w i − μ ∣ b ]
\implies \log p(\mathbf{w}) = -n\log 2b - \sum_{i=1}^{n} \left[ \dfrac{|w_{i} - \mu|}{b} \right]
⟹ log p ( w ) = − n log 2 b − i = 1 ∑ n [ b ∣ w i − μ ∣ ]
Since the posterior probability is represented as an exponential function, considering the log likelihood is more convenient for calculation.
w MAP = arg max w log ( p ( y ∣ w , X ) p ( w ) ) = arg max w [ − 1 2 σ 2 ∥ y − X w ∥ 2 2 − ∑ i = 1 n ∣ w i − μ ∣ b ] = arg min w [ 1 2 σ 2 ∥ y − X w ∥ 2 2 + 1 b ∥ w − μ 1 ∥ 1 ]
\begin{align*}
\mathbf{w}_{\text{MAP}}
&= \argmax_{\mathbf{w}} \log (p(\mathbf{y} | \mathbf{w}, \mathbf{X}) p(\mathbf{w})) \\
&= \argmax_{\mathbf{w}} \left[-\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} - \sum_{i=1}^{n} \dfrac{|w_{i} - \mu|}{b} \right] \\
&= \argmin_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} + \dfrac{1}{b} \| \mathbf{w} - \mu\mathbf{1} \|_{1} \right] \\
\end{align*}
w MAP = w arg max log ( p ( y ∣ w , X ) p ( w )) = w arg max [ − 2 σ 2 1 ∥ y − Xw ∥ 2 2 − i = 1 ∑ n b ∣ w i − μ ∣ ] = w arg min [ 2 σ 2 1 ∥ y − Xw ∥ 2 2 + b 1 ∥ w − μ 1 ∥ 1 ]
Here, 1 ∈ R n \mathbf{1} \in \mathbb{R}^{n} 1 ∈ R n is a vector with all elements being 1. If we let μ = 0 \mu = 0 μ = 0 , it becomes a form similar to lasso regression .
arg min w [ 1 2 σ 2 ∥ y − X w ∥ 2 2 + 1 b ∥ w ∥ 1 ]
\argmin_{\mathbf{w}} \left[\dfrac{1}{2\sigma^{2}} \| \mathbf{y} - \mathbf{X}\mathbf{w} \|_{2}^{2} + \dfrac{1}{b} \| \mathbf{w} \|_{1} \right]
w arg min [ 2 σ 2 1 ∥ y − Xw ∥ 2 2 + b 1 ∥ w ∥ 1 ]
Unfortunately, in this case, it is known that there is no closed form for the optimal solution (2571/#최적해-6).
See Also