logo

Relative Entropy (Kullback-Leibler Divergence) between Two Normal Distributions 📂Probability Distribution

Relative Entropy (Kullback-Leibler Divergence) between Two Normal Distributions

Formula

The relative entropy (KLD) between two normal distributions N(μ,σ2)N(\mu, \sigma^{2}) and N(μ1,σ12)N(\mu_{1}, \sigma_{1}^{2}) is given by the following expression.

DKL(N(μ,σ2)N(μ1,σ12))=log(σ1σ)+σ2+(μμ1)22σ1212 D_{\text{KL}}\big( N(\mu, \sigma^{2}) \| N(\mu_{1}, \sigma_{1}^{2}) \big) = \log \left( \dfrac{\sigma_{1}}{\sigma} \right) + \dfrac{\sigma^{2} + (\mu - \mu_{1})^{2}}{2\sigma_{1}^{2}} - \dfrac{1}{2}

The relative entropy between two multivariate normal distributions N(μ,Σ)N(\boldsymbol{\mu}, \Sigma) and N(μ1,Σ1)N(\boldsymbol{\mu_{1}}, \Sigma_{1}) is given by the following.

DKL(N(μ,Σ)N(μ1,Σ1))=12[log(ΣΣ1)+Tr(Σ11Σ)+(μμ1)TΣ11(μμ1)k] \begin{array}{l} D_{\text{KL}}\big( N(\boldsymbol{\mu}, \Sigma) \| N(\boldsymbol{\mu_{1}}, \Sigma_{1}) \big) \\[1em] = \dfrac{1}{2} \left[ \log \left( \dfrac{|\Sigma|}{|\Sigma_{1}|} \right) + \Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}) - k \right] \end{array}

Explanation

The relative entropy between two normal distributions can be obtained in a closed form as shown above. This is highly useful since it implies that there is no need for sampling to compute relative entropy in computer simulations. In particular, it offers significant advantages in learning tasks related to machine learning and deep learning. It provides stable and accurate values when compared to those calculated by sampling and is efficient in terms of computation time and memory usage.

Proof

Univariate

Normal Distribution: The probability density function of a normal distribution N(μ,σ2)N(\mu, \sigma^{2}) with mean μR\mu \in \mathbb{R} and variance σ2>0\sigma^{2} \gt 0 is given as follows. p(x)=12πσ2exp((xμ)22σ2) p(x) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{(x - \mu)^{2}}{2\sigma^{2}}\right)

Let pN(μ,σ2)p \sim N(\mu, \sigma^{2}), qN(μ1,σ12)q \sim N(\mu_{1}, \sigma_{1}^{2}). The log probability density function is as follows.

logp=12log(2πσ2)(xμ)22σ2 \log p = -\frac{1}{2} \log(2 \pi \sigma^{2}) - \frac{(x - \mu)^{2}}{2\sigma^{2}}

Thus, the KLD is computed as follows.

DKL(pq)=Ep[12log(2πσ2)(xμ)22σ2+12log(2πσ12)+(xμ1)22σ12]=12log(2πσ122πσ2)+Ep[(xμ)22σ2+(xμ1)22σ12]=log(σ1σ)12σ2Ep[(xμ)2]+12σ12Ep[(xμ1)2] \begin{align*} D_{\text{KL}}(p \| q) &= \mathbb{E}_{p} \left[ -\frac{1}{2} \log(2 \pi \sigma^{2}) - \frac{(x - \mu)^{2}}{2\sigma^{2}} + \frac{1}{2} \log(2 \pi \sigma_{1}^{2}) + \frac{(x - \mu_{1})^{2}}{2\sigma_{1}^{2}} \right] \\ &= \dfrac{1}{2}\log \left( \dfrac{2 \pi \sigma_{1}^{2}}{2 \pi \sigma^{2}} \right) + \mathbb{E}_{p} \left[ -\frac{(x - \mu)^{2}}{2\sigma^{2}} + \frac{(x - \mu_{1})^{2}}{2\sigma_{1}^{2}} \right] \\ &= \log \left( \dfrac{\sigma_{1}}{\sigma} \right) - \dfrac{1}{2\sigma^{2}}\mathbb{E}_{p} \left[ (x - \mu)^{2} \right] + \dfrac{1}{2\sigma_{1}^{2}}\mathbb{E}_{p} \left[ (x - \mu_{1})^{2} \right] \\ \end{align*}

Here, the expected value of the second term is Ep[(xμ)2]=σ2\mathbb{E}_{p}[(x - \mu)^{2}] = \sigma^{2} due to the definition of variance. The third term is calculated as follows.

Ep[(xμ1)2]=Ep[((xμ)+(μμ1))2]=Ep[(xμ)2+2(μμ1)(xμ)+(μμ1)2]=Ep[(xμ)2]+2(μμ1)Ep[xμ]+(μμ1)2=σ2+0+(μμ1)2 \begin{align*} \mathbb{E}_{p}[(x - \mu_{1})^{2}] &= \mathbb{E}_{p}[((x - \mu) + (\mu - \mu_{1}))^{2}] \\ &= \mathbb{E}_{p}[(x - \mu)^{2} + 2(\mu - \mu_{1})(x - \mu) + (\mu - \mu_{1})^{2}] \\ &= \mathbb{E}_{p}[(x - \mu)^{2}] + 2(\mu - \mu_{1})\mathbb{E}_{p}[x - \mu] + (\mu - \mu_{1})^{2} \\ &= \sigma^{2} + 0 + (\mu - \mu_{1})^{2} \end{align*}

Therefore, we obtain the following.

DKL(pq)=log(σ1σ)12σ2σ2+12σ12(σ2+(μμ1)2)=log(σ1σ)+σ2+(μμ1)22σ1212 \begin{align*} D_{\text{KL}}(p \| q) &= \log \left( \dfrac{\sigma_{1}}{\sigma} \right) -\dfrac{1}{2\sigma^{2}} \sigma^{2} + \dfrac{1}{2\sigma_{1}^{2}} \left( \sigma^{2} + (\mu - \mu_{1})^{2} \right) \\ &= \log \left( \dfrac{\sigma_{1}}{\sigma} \right) + \dfrac{\sigma^{2} + (\mu - \mu_{1})^{2}}{2\sigma_{1}^{2}} - \dfrac{1}{2} \\ \end{align*}

Multivariate

Multivariate Normal Distribution: The probability density function following a multivariate normal distribution N(μ,Σ)N(\boldsymbol{\mu}, \Sigma) with mean vector μRk\boldsymbol{\mu} \in \mathbb{R}^{k} and covariance matrix ΣRk×k\Sigma \in \mathbb{R}^{k \times k} is as follows. p(x)=1(2π)kΣexp(12(xμ)TΣ1(xμ)) p(\mathbf{x}) = \frac{1}{\sqrt{(2 \pi)^{k} |\Sigma|}} \exp \left(-\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)

Let pN(μ,Σ)p \sim N(\boldsymbol{\mu}, \Sigma), qN(μ1,Σ1)q \sim N(\boldsymbol{\mu_{1}}, \Sigma_{1}). The log probability density function is as follows.

logp=12log((2π)kΣ)12(xμ)TΣ1(xμ) \log p = -\frac{1}{2} \log((2 \pi)^{k} |\Sigma|) - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})

Hence, the KLD is computed as follows.

DKL(pq)=Ep[12log((2π)kΣ)12(xμ)TΣ1(xμ)+12log((2π)kΣ1)+12(xμ1)TΣ11(xμ1)]=12log(Σ1Σ)+Ep[12(xμ)TΣ1(xμ)+12(xμ1)TΣ11(xμ1)]=12log(Σ1Σ)12Ep[(xμ)TΣ1(xμ)]+12Ep[(xμ1)TΣ11(xμ1)] \begin{align*} D_{\text{KL}}(p \| q) &= \mathbb{E}_{p} \left[ -\frac{1}{2} \log((2 \pi)^{k} |\Sigma|) - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right. \\ &\qquad\qquad + \left. \frac{1}{2} \log((2 \pi)^{k} |\Sigma_{1}|) + \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\mathbf{x} - \boldsymbol{\mu_{1}}) \right] \\ &= \dfrac{1}{2} \log \left( \dfrac{|\Sigma_{1}|}{|\Sigma|} \right) + \mathbb{E}_{p} \left[ -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) + \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\mathbf{x} - \boldsymbol{\mu_{1}}) \right] \\ &= \dfrac{1}{2} \log \left( \dfrac{|\Sigma_{1}|}{|\Sigma|} \right) - \dfrac{1}{2} \mathbb{E}_{p} \left[ (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right] + \dfrac{1}{2} \mathbb{E}_{p} \left[ (\mathbf{x} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\mathbf{x} - \boldsymbol{\mu_{1}}) \right] \end{align*}

Expectation of Quadratic Form of Random Vector

For a random vector xN(μ,Σ)\mathbf{x} \sim N(\boldsymbol{\mu}, \Sigma) and a symmetric matrix AA, the following holds.

E[(xμ)TA(xμ)]=tr(AΣ)E[(xμ1)TA(xμ1)]=tr(AΣ)+(μμ1)TA(μμ1) \begin{align*} \mathbb{E}\left[ (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}}A (\mathbf{x} - \boldsymbol{\mu}) \right] &= \tr (A \Sigma) \\ \mathbb{E}\left[ (\mathbf{x} - \boldsymbol{\mu}_{1})^{\mathsf{T}}A (\mathbf{x} - \boldsymbol{\mu}_{1}) \right] &= \tr (A \Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu}_{1})^{\mathsf{T}} A (\boldsymbol{\mu} - \boldsymbol{\mu}_{1}) \end{align*}

Due to the above formula, and since the covariance matrix is symmetric, the expected value of the second term is tr(Σ1Σ)=Tr(I)=k\tr(\Sigma^{-1} \Sigma) = \Tr (I) = k. The expected value of the third term is Tr(Σ11Σ)+(μμ1)TΣ11(μμ1)\Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}). Therefore, we obtain the following.

DKL(pq)=12log(Σ1Σ)12k+12(Tr(Σ11Σ)+(μμ1)TΣ11(μμ1))=12[log(ΣΣ1)+Tr(Σ11Σ)+(μμ1)TΣ11(μμ1)k] \begin{align*} D_{\text{KL}}(p \| q) &= \dfrac{1}{2} \log \left( \dfrac{|\Sigma_{1}|}{|\Sigma|} \right) - \dfrac{1}{2}k + \dfrac{1}{2} \left( \Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}) \right) \\ &= \dfrac{1}{2} \left[ \log \left( \dfrac{|\Sigma|}{|\Sigma_{1}|} \right) + \Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}) - k \right] \end{align*}