Relative Entropy (Kullback-Leibler Divergence) between Two Normal Distributions
📂Probability Distribution Relative Entropy (Kullback-Leibler Divergence) between Two Normal Distributions The relative entropy (KLD) between two normal distributions N ( μ , σ 2 ) N(\mu, \sigma^{2}) N ( μ , σ 2 ) and N ( μ 1 , σ 1 2 ) N(\mu_{1}, \sigma_{1}^{2}) N ( μ 1 , σ 1 2 ) is given by the following expression.
D KL ( N ( μ , σ 2 ) ∥ N ( μ 1 , σ 1 2 ) ) = log ( σ 1 σ ) + σ 2 + ( μ − μ 1 ) 2 2 σ 1 2 − 1 2
D_{\text{KL}}\big( N(\mu, \sigma^{2}) \| N(\mu_{1}, \sigma_{1}^{2}) \big)
= \log \left( \dfrac{\sigma_{1}}{\sigma} \right) + \dfrac{\sigma^{2} + (\mu - \mu_{1})^{2}}{2\sigma_{1}^{2}} - \dfrac{1}{2}
D KL ( N ( μ , σ 2 ) ∥ N ( μ 1 , σ 1 2 ) ) = log ( σ σ 1 ) + 2 σ 1 2 σ 2 + ( μ − μ 1 ) 2 − 2 1
The relative entropy between two multivariate normal distributions N ( μ , Σ ) N(\boldsymbol{\mu}, \Sigma) N ( μ , Σ ) and N ( μ 1 , Σ 1 ) N(\boldsymbol{\mu_{1}}, \Sigma_{1}) N ( μ 1 , Σ 1 ) is given by the following.
D KL ( N ( μ , Σ ) ∥ N ( μ 1 , Σ 1 ) ) = 1 2 [ log ( ∣ Σ ∣ ∣ Σ 1 ∣ ) + Tr ( Σ 1 − 1 Σ ) + ( μ − μ 1 ) T Σ 1 − 1 ( μ − μ 1 ) − k ]
\begin{array}{l}
D_{\text{KL}}\big( N(\boldsymbol{\mu}, \Sigma) \| N(\boldsymbol{\mu_{1}}, \Sigma_{1}) \big) \\[1em]
= \dfrac{1}{2} \left[ \log \left( \dfrac{|\Sigma|}{|\Sigma_{1}|} \right) + \Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}) - k \right]
\end{array}
D KL ( N ( μ , Σ ) ∥ N ( μ 1 , Σ 1 ) ) = 2 1 [ log ( ∣ Σ 1 ∣ ∣Σ∣ ) + Tr ( Σ 1 − 1 Σ ) + ( μ − μ 1 ) T Σ 1 − 1 ( μ − μ 1 ) − k ]
Explanation The relative entropy between two normal distributions can be obtained in a closed form as shown above. This is highly useful since it implies that there is no need for sampling to compute relative entropy in computer simulations. In particular, it offers significant advantages in learning tasks related to machine learning and deep learning . It provides stable and accurate values when compared to those calculated by sampling and is efficient in terms of computation time and memory usage.
Proof Univariate Normal Distribution :
The probability density function of a normal distribution N ( μ , σ 2 ) N(\mu, \sigma^{2}) N ( μ , σ 2 ) with mean μ ∈ R \mu \in \mathbb{R} μ ∈ R and variance σ 2 > 0 \sigma^{2} \gt 0 σ 2 > 0 is given as follows.
p ( x ) = 1 2 π σ 2 exp ( − ( x − μ ) 2 2 σ 2 )
p(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x - \mu)^{2}}{2\sigma^{2}}\right)
p ( x ) = 2 π σ 2 1 exp ( − 2 σ 2 ( x − μ ) 2 )
Let p ∼ N ( μ , σ 2 ) p \sim N(\mu, \sigma^{2}) p ∼ N ( μ , σ 2 ) , q ∼ N ( μ 1 , σ 1 2 ) q \sim N(\mu_{1}, \sigma_{1}^{2}) q ∼ N ( μ 1 , σ 1 2 ) . The log probability density function is as follows.
log p = − 1 2 log ( 2 π σ 2 ) − ( x − μ ) 2 2 σ 2
\log p = -\frac{1}{2} \log(2 \pi \sigma^{2}) - \frac{(x - \mu)^{2}}{2\sigma^{2}}
log p = − 2 1 log ( 2 π σ 2 ) − 2 σ 2 ( x − μ ) 2
Thus, the KLD is computed as follows.
D KL ( p ∥ q ) = E p [ − 1 2 log ( 2 π σ 2 ) − ( x − μ ) 2 2 σ 2 + 1 2 log ( 2 π σ 1 2 ) + ( x − μ 1 ) 2 2 σ 1 2 ] = 1 2 log ( 2 π σ 1 2 2 π σ 2 ) + E p [ − ( x − μ ) 2 2 σ 2 + ( x − μ 1 ) 2 2 σ 1 2 ] = log ( σ 1 σ ) − 1 2 σ 2 E p [ ( x − μ ) 2 ] + 1 2 σ 1 2 E p [ ( x − μ 1 ) 2 ]
\begin{align*}
D_{\text{KL}}(p \| q)
&= \mathbb{E}_{p} \left[ -\frac{1}{2} \log(2 \pi \sigma^{2}) - \frac{(x - \mu)^{2}}{2\sigma^{2}} + \frac{1}{2} \log(2 \pi \sigma_{1}^{2}) + \frac{(x - \mu_{1})^{2}}{2\sigma_{1}^{2}} \right] \\
&= \dfrac{1}{2}\log \left( \dfrac{2 \pi \sigma_{1}^{2}}{2 \pi \sigma^{2}} \right) + \mathbb{E}_{p} \left[ -\frac{(x - \mu)^{2}}{2\sigma^{2}} + \frac{(x - \mu_{1})^{2}}{2\sigma_{1}^{2}} \right] \\
&= \log \left( \dfrac{\sigma_{1}}{\sigma} \right) - \dfrac{1}{2\sigma^{2}}\mathbb{E}_{p} \left[ (x - \mu)^{2} \right] + \dfrac{1}{2\sigma_{1}^{2}}\mathbb{E}_{p} \left[ (x - \mu_{1})^{2} \right] \\
\end{align*}
D KL ( p ∥ q ) = E p [ − 2 1 log ( 2 π σ 2 ) − 2 σ 2 ( x − μ ) 2 + 2 1 log ( 2 π σ 1 2 ) + 2 σ 1 2 ( x − μ 1 ) 2 ] = 2 1 log ( 2 π σ 2 2 π σ 1 2 ) + E p [ − 2 σ 2 ( x − μ ) 2 + 2 σ 1 2 ( x − μ 1 ) 2 ] = log ( σ σ 1 ) − 2 σ 2 1 E p [ ( x − μ ) 2 ] + 2 σ 1 2 1 E p [ ( x − μ 1 ) 2 ]
Here, the expected value of the second term is E p [ ( x − μ ) 2 ] = σ 2 \mathbb{E}_{p}[(x - \mu)^{2}] = \sigma^{2} E p [( x − μ ) 2 ] = σ 2 due to the definition of variance . The third term is calculated as follows.
E p [ ( x − μ 1 ) 2 ] = E p [ ( ( x − μ ) + ( μ − μ 1 ) ) 2 ] = E p [ ( x − μ ) 2 + 2 ( μ − μ 1 ) ( x − μ ) + ( μ − μ 1 ) 2 ] = E p [ ( x − μ ) 2 ] + 2 ( μ − μ 1 ) E p [ x − μ ] + ( μ − μ 1 ) 2 = σ 2 + 0 + ( μ − μ 1 ) 2
\begin{align*}
\mathbb{E}_{p}[(x - \mu_{1})^{2}]
&= \mathbb{E}_{p}[((x - \mu) + (\mu - \mu_{1}))^{2}] \\
&= \mathbb{E}_{p}[(x - \mu)^{2} + 2(\mu - \mu_{1})(x - \mu) + (\mu - \mu_{1})^{2}] \\
&= \mathbb{E}_{p}[(x - \mu)^{2}] + 2(\mu - \mu_{1})\mathbb{E}_{p}[x - \mu] + (\mu - \mu_{1})^{2} \\
&= \sigma^{2} + 0 + (\mu - \mu_{1})^{2}
\end{align*}
E p [( x − μ 1 ) 2 ] = E p [(( x − μ ) + ( μ − μ 1 ) ) 2 ] = E p [( x − μ ) 2 + 2 ( μ − μ 1 ) ( x − μ ) + ( μ − μ 1 ) 2 ] = E p [( x − μ ) 2 ] + 2 ( μ − μ 1 ) E p [ x − μ ] + ( μ − μ 1 ) 2 = σ 2 + 0 + ( μ − μ 1 ) 2
Therefore, we obtain the following.
D KL ( p ∥ q ) = log ( σ 1 σ ) − 1 2 σ 2 σ 2 + 1 2 σ 1 2 ( σ 2 + ( μ − μ 1 ) 2 ) = log ( σ 1 σ ) + σ 2 + ( μ − μ 1 ) 2 2 σ 1 2 − 1 2
\begin{align*}
D_{\text{KL}}(p \| q)
&= \log \left( \dfrac{\sigma_{1}}{\sigma} \right) -\dfrac{1}{2\sigma^{2}} \sigma^{2} + \dfrac{1}{2\sigma_{1}^{2}} \left( \sigma^{2} + (\mu - \mu_{1})^{2} \right) \\
&= \log \left( \dfrac{\sigma_{1}}{\sigma} \right) + \dfrac{\sigma^{2} + (\mu - \mu_{1})^{2}}{2\sigma_{1}^{2}} - \dfrac{1}{2} \\
\end{align*}
D KL ( p ∥ q ) = log ( σ σ 1 ) − 2 σ 2 1 σ 2 + 2 σ 1 2 1 ( σ 2 + ( μ − μ 1 ) 2 ) = log ( σ σ 1 ) + 2 σ 1 2 σ 2 + ( μ − μ 1 ) 2 − 2 1
■
Multivariate Multivariate Normal Distribution :
The probability density function following a multivariate normal distribution N ( μ , Σ ) N(\boldsymbol{\mu}, \Sigma) N ( μ , Σ ) with mean vector μ ∈ R k \boldsymbol{\mu} \in \mathbb{R}^{k} μ ∈ R k and covariance matrix Σ ∈ R k × k \Sigma \in \mathbb{R}^{k \times k} Σ ∈ R k × k is as follows.
p ( x ) = 1 ( 2 π ) k ∣ Σ ∣ exp ( − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) )
p(\mathbf{x}) = \frac{1}{\sqrt{(2 \pi)^{k} |\Sigma|}} \exp \left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)
p ( x ) = ( 2 π ) k ∣Σ∣ 1 exp ( − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) )
Let p ∼ N ( μ , Σ ) p \sim N(\boldsymbol{\mu}, \Sigma) p ∼ N ( μ , Σ ) , q ∼ N ( μ 1 , Σ 1 ) q \sim N(\boldsymbol{\mu_{1}}, \Sigma_{1}) q ∼ N ( μ 1 , Σ 1 ) . The log probability density function is as follows.
log p = − 1 2 log ( ( 2 π ) k ∣ Σ ∣ ) − 1 2 ( x − μ ) T Σ − 1 ( x − μ )
\log p = -\frac{1}{2} \log((2 \pi)^{k} |\Sigma|) - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})
log p = − 2 1 log (( 2 π ) k ∣Σ∣ ) − 2 1 ( x − μ ) T Σ − 1 ( x − μ )
Hence, the KLD is computed as follows.
D KL ( p ∥ q ) = E p [ − 1 2 log ( ( 2 π ) k ∣ Σ ∣ ) − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) + 1 2 log ( ( 2 π ) k ∣ Σ 1 ∣ ) + 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ] = 1 2 log ( ∣ Σ 1 ∣ ∣ Σ ∣ ) + E p [ − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) + 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ] = 1 2 log ( ∣ Σ 1 ∣ ∣ Σ ∣ ) − 1 2 E p [ ( x − μ ) T Σ − 1 ( x − μ ) ] + 1 2 E p [ ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ]
\begin{align*}
D_{\text{KL}}(p \| q)
&= \mathbb{E}_{p} \left[ -\frac{1}{2} \log((2 \pi)^{k} |\Sigma|) - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right. \\
&\qquad\qquad + \left. \frac{1}{2} \log((2 \pi)^{k} |\Sigma_{1}|) + \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\mathbf{x} - \boldsymbol{\mu_{1}}) \right] \\
&= \dfrac{1}{2} \log \left( \dfrac{|\Sigma_{1}|}{|\Sigma|} \right) + \mathbb{E}_{p} \left[ -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) + \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\mathbf{x} - \boldsymbol{\mu_{1}}) \right] \\
&= \dfrac{1}{2} \log \left( \dfrac{|\Sigma_{1}|}{|\Sigma|} \right) - \dfrac{1}{2} \mathbb{E}_{p} \left[ (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right] + \dfrac{1}{2} \mathbb{E}_{p} \left[ (\mathbf{x} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\mathbf{x} - \boldsymbol{\mu_{1}}) \right]
\end{align*}
D KL ( p ∥ q ) = E p [ − 2 1 log (( 2 π ) k ∣Σ∣ ) − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) + 2 1 log (( 2 π ) k ∣ Σ 1 ∣ ) + 2 1 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ] = 2 1 log ( ∣Σ∣ ∣ Σ 1 ∣ ) + E p [ − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) + 2 1 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ] = 2 1 log ( ∣Σ∣ ∣ Σ 1 ∣ ) − 2 1 E p [ ( x − μ ) T Σ − 1 ( x − μ ) ] + 2 1 E p [ ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ]
Expectation of Quadratic Form of Random Vector
For a random vector x ∼ N ( μ , Σ ) \mathbf{x} \sim N(\boldsymbol{\mu}, \Sigma) x ∼ N ( μ , Σ ) and a symmetric matrix A A A , the following holds.
E [ ( x − μ ) T A ( x − μ ) ] = tr ( A Σ ) E [ ( x − μ 1 ) T A ( x − μ 1 ) ] = tr ( A Σ ) + ( μ − μ 1 ) T A ( μ − μ 1 )
\begin{align*}
\mathbb{E}\left[ (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}}A (\mathbf{x} - \boldsymbol{\mu}) \right] &= \tr (A \Sigma) \\
\mathbb{E}\left[ (\mathbf{x} - \boldsymbol{\mu}_{1})^{\mathsf{T}}A (\mathbf{x} - \boldsymbol{\mu}_{1}) \right] &= \tr (A \Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu}_{1})^{\mathsf{T}} A (\boldsymbol{\mu} - \boldsymbol{\mu}_{1})
\end{align*}
E [ ( x − μ ) T A ( x − μ ) ] E [ ( x − μ 1 ) T A ( x − μ 1 ) ] = tr ( A Σ ) = tr ( A Σ ) + ( μ − μ 1 ) T A ( μ − μ 1 )
Due to the above formula, and since the covariance matrix is symmetric, the expected value of the second term is tr ( Σ − 1 Σ ) = Tr ( I ) = k \tr(\Sigma^{-1} \Sigma) = \Tr (I) = k tr ( Σ − 1 Σ ) = Tr ( I ) = k . The expected value of the third term is Tr ( Σ 1 − 1 Σ ) + ( μ − μ 1 ) T Σ 1 − 1 ( μ − μ 1 ) \Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}) Tr ( Σ 1 − 1 Σ ) + ( μ − μ 1 ) T Σ 1 − 1 ( μ − μ 1 ) . Therefore, we obtain the following.
D KL ( p ∥ q ) = 1 2 log ( ∣ Σ 1 ∣ ∣ Σ ∣ ) − 1 2 k + 1 2 ( Tr ( Σ 1 − 1 Σ ) + ( μ − μ 1 ) T Σ 1 − 1 ( μ − μ 1 ) ) = 1 2 [ log ( ∣ Σ ∣ ∣ Σ 1 ∣ ) + Tr ( Σ 1 − 1 Σ ) + ( μ − μ 1 ) T Σ 1 − 1 ( μ − μ 1 ) − k ]
\begin{align*}
D_{\text{KL}}(p \| q)
&= \dfrac{1}{2} \log \left( \dfrac{|\Sigma_{1}|}{|\Sigma|} \right) - \dfrac{1}{2}k + \dfrac{1}{2} \left( \Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}) \right) \\
&= \dfrac{1}{2} \left[ \log \left( \dfrac{|\Sigma|}{|\Sigma_{1}|} \right) + \Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}) - k \right]
\end{align*}
D KL ( p ∥ q ) = 2 1 log ( ∣Σ∣ ∣ Σ 1 ∣ ) − 2 1 k + 2 1 ( Tr ( Σ 1 − 1 Σ ) + ( μ − μ 1 ) T Σ 1 − 1 ( μ − μ 1 ) ) = 2 1 [ log ( ∣ Σ 1 ∣ ∣Σ∣ ) + Tr ( Σ 1 − 1 Σ ) + ( μ − μ 1 ) T Σ 1 − 1 ( μ − μ 1 ) − k ]
■