logo

Paper Review: Denoising Diffusion Probabilistic Models (DDPM) 📂Machine Learning

Paper Review: Denoising Diffusion Probabilistic Models (DDPM)

Overview and Summary

A generative model refers to a method to find a probability distribution YY that a given random sample {yi}\left\{ y_{i} \right\} follows. As it’s highly challenging to directly find this from scratch, it’s common to use well-known distributions to approximate the desired distribution. Thus, if a dataset {xi}\left\{ x_{i} \right\} following a well-known distribution XX is given, the generative model can be regarded as a function ff, and developing a generative model involves finding a function that closely approximates ff.

f:{xi}{yj} f : \left\{ x_{i} \right\} \to \left\{ y_{j} \right\}

The most commonly used well-known distribution is the normal distribution. Therefore, to understand generative models easily, one might consider it as a method to extract samples following an unknown distribution from a normal distribution.

The Denoising Diffusion Probabilistic Models (DDPM)1 similarly introduces a method to extract data following an unknown distribution from a normal distribution. Consider a noise z\mathbf{z} extracted from the normal distribution N(0,I)N(0, I) and an image x0\mathbf{x}_{0} assumed to be extracted from some unknown distribution XX. While it’s challenging to find a function x0=p(z)\mathbf{x}_{0} = p(\mathbf{z}) that creates x0\mathbf{x}_{0} from z\mathbf{z}, finding the inverse function z=q(x0)\mathbf{z} = q(\mathbf{x}_{0}) that degrades x0\mathbf{x}_{0} to produce z\mathbf{z} is also not easy. (Photo credits2)

However, it is easy to obtain a slightly deteriorated image x1=x0+ϵ1\mathbf{x}_{1} = \mathbf{x}_{0} + \boldsymbol{\epsilon}_{1} by adding a tiny noise ϵ1\boldsymbol{\epsilon}_{1} extracted from a normal distribution to the image x0\mathbf{x}_{0}. Extracting and adding small Gaussian noise ϵ2\boldsymbol{\epsilon}_{2} to obtain x2=x1+ϵ2\mathbf{x}_{2} = \mathbf{x}_{1} + \boldsymbol{\epsilon}_{2} is also easy. Repeating such a process many times, xT\mathbf{x}_{T} would look like noise drawn from a normal distribution. In other words, continuously adding Gaussian noise to the image x0\mathbf{x}_{0} results in an image similar to noise sampled from a normal distribution, a process termed diffusion. The process is called diffusion because, mathematically, the phenomenon of particles spreading out randomly, known as Brownian motion, can be described by continuously adding values drawn from a Gaussian distribution; real-world Brownian motion is related to the diffusion (heat) equation.

Since diffusion involves adding noise repeatedly, its reverse process involves subtracting noise, naturally leading to denoising as an inverse process. This paper introduces a method to perform the unknown denoising process by utilizing knowledge about the known diffusion process. The idea was initially introduced as diffusion probabilistic models (DPM) in a 2015 paper3: Deep Unsupervised Learning using Nonequilibrium Thermodynamics, and DDPM complements this with effective integration with deep learning.

The DDPM paper and its appendix are not very verbose with mathematical exposition. Even numerous reviews of this paper often fail to provide a clear explanation of how the formulas are derived. This article focuses meticulously on the mathematical description of DDPM, rather than its performance or results. Although the text is somewhat lengthy, there should be no obstacles in the derivation of the formulas.

Not a single line of formula has been glossed over. This is arguably the most mathematically accurate and detailed explanation of the DDPM paper’s formulas worldwide.

Preliminaries

In our paper, the random variable xRD\mathbf{x} \in \mathbb{R}^{D} and its realization are both denoted as xRD\mathbf{x} \in \mathbb{R}^{D}. Generally, the probability density function of a random variable XX is expressed as pX(x)p_{X}(x), while in this paper, it will be denoted as p(x)=px(x)p(\mathbf{x}) = p_{\mathbf{x}}(\mathbf{x}). Feature Extraction

A sequence of random variables (x0,x1,,xT)={xt}t=0T(\mathbf{x}_{0}, \mathbf{x}_{1}, \dots, \mathbf{x}_{T}) = \left\{ \mathbf{x}_{t} \right\}_{t=0}^{T} is termed a stochastic process. In this paper, a stochastic process is denoted as x0:T=(x0,x1,,xT)\mathbf{x}_{0:T} = (\mathbf{x}_{0}, \mathbf{x}_{1}, \dots, \mathbf{x}_{T}). The joint probability density function of x0:T\mathbf{x}_{0:T} is expressed as follows:

p(x0:T)=px0:T(x0:T) p(\mathbf{x}_{0:T}) = p_{\mathbf{x}_{0:T}}(\mathbf{x}_{0:T})

Given x0:T1\mathbf{x}_{0:T-1}, the conditional probability density function of xT\mathbf{x}_{T} is defined as p(xTx0:T1)p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1}) if it satisfies the following:

p(xTx0:T1)=p(x0:T)p(x0:T1) p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1}) = \dfrac{p(\mathbf{x}_{0:T})}{p(\mathbf{x}_{0:T-1})}

By repeating the above definition, we obtain the following.

p(x0:T)=p(x0)p(x1x0)p(x2x0:1)p(xT1x0:T2)p(xTx0:T1)(1) p(\mathbf{x}_{0:T}) = p(\mathbf{x}_{0}) p(\mathbf{x}_{1} | \mathbf{x}_{0}) p(\mathbf{x}_{2} | \mathbf{x}_{0:1}) \cdots p(\mathbf{x}_{T-1} | \mathbf{x}_{0:T-2}) p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1}) \tag{1}

If x0:T\mathbf{x}_{0:T} satisfies the following, it is termed a Markov chain.

p(xTx0:T1)=p(xTxT1) p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1}) = p(\mathbf{x}_{T} | \mathbf{x}_{T-1})

If x0:T\mathbf{x}_{0:T} is a Markov chain, then (1)(1) simplifies as follows:

p(x0:T)=p(x0)p(x1x0)p(x2x0:1)p(xT1x0:T2)p(xTx0:T1)=p(x0)p(x1x0)p(x2x1)p(xT1xT2)p(xTxT1)=p(x0)t=1Tp(xtxt1)(2) \begin{align*} p(\mathbf{x}_{0:T}) &= p(\mathbf{x}_{0}) p(\mathbf{x}_{1} | \mathbf{x}_{0}) p(\mathbf{x}_{2} | \mathbf{x}_{0:1}) \cdots p(\mathbf{x}_{T-1} | \mathbf{x}_{0:T-2}) p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1}) \\ &= p(\mathbf{x}_{0}) p(\mathbf{x}_{1} | \mathbf{x}_{0}) p(\mathbf{x}_{2} | \mathbf{x}_{1}) \cdots p(\mathbf{x}_{T-1} | \mathbf{x}_{T-2}) p(\mathbf{x}_{T} | \mathbf{x}_{T-1}) \\ &= p(\mathbf{x}_{0}) \prod\limits_{t=1}^{T} p(\mathbf{x}_{t} | \mathbf{x}_{t-1}) \end{align*} \tag{2}

2 Background

Section 2 briefly introduces how the conceptual idea of the diffusion model can be mathematically expressed. Since the foundational idea comes from other papers, the explanation is not in-depth, making it difficult for beginners due to omitted details. I’ll elaborate on these as thoroughly as possible below.

  • Diffusion Process (Forward Process)

Let’s mathematically outline the diffusion process. An image xtRD\mathbf{x}_{t} \in \mathbb{R}^{D} at step tt is xt1RD\mathbf{x}_{t-1} \in \mathbb{R}^{D} from step t1t-1 with Gaussian noise βtϵtN(0,βtI)\sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \sim \mathcal{N}(\mathbf{0}, \beta_{t}\mathbf{I}) added. βt\beta_{t} determines the extent of diffusion of the noise and can be expressed as follows:

xt=1βtxt1+βtϵt,ϵtN(0,I) \mathbf{x}_{t} = \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t}, \qquad \boldsymbol{\epsilon}_{t} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

The reason the coefficient is (1βt,βt)(\sqrt{1-\beta_{t}}, \sqrt{\beta_{t}}) rather than (1βt,βt)(1-\beta_{t}, \beta_{t}) is due to the properties of variance Var(aX)=a2Var(X)\Var(aX) = a^{2}\Var(X). To ensure all n1n \ge 1 approximately follow a standard normal distribution when tt is sufficiently large, the coefficient must include \sqrt{}. Assuming tt is sufficiently large, xt\mathbf{x}_{t} will follow a standard normal distribution. At that time, the covariance matrix of xt+1\mathbf{x}_{t+1} is maintained as shown below (still following a standard normal distribution). As xt1\mathbf{x}_{t-1} and ϵt\boldsymbol{\epsilon}_{t} are independent, the following holds:

Cov(xt)=Cov(1βtxt1+βtϵt)=(1βt)Cov(xt1)+βtCov(ϵt)=(1βt)I+βtI=I \begin{align*} \Cov(\mathbf{x}_{t}) &= \Cov(\sqrt{1-\beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t}) \\ &= (1-\beta_{t})\Cov(\mathbf{x}_{t-1}) + \beta_{t}\Cov(\boldsymbol{\epsilon}_{t}) \\ &= (1-\beta_{t})\mathbf{I} + \beta_{t}\mathbf{I} \\ &= \mathbf{I} \end{align*}

When xt1\mathbf{x}_{t-1} is fixed (as a constant), the conditional probability density function for xt\mathbf{x}_{t} is as follows:

q(xtxt1)=N(1βtxt1,βtI)(3) q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{1 - \beta_{t}}\mathbf{x}_{t-1}, \beta_{t}\mathbf{I}) \tag{3}

(The notation N(xt;1βtxt1,βtI)\mathcal{N}(\mathbf{x}_{t}; \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1}, \beta_{t}\mathbf{I}) was used in the paper to explicitly mention that it represents the distribution for xt\mathbf{x}_{t}.) Each of the following verifies that the mean vector and covariance matrix are as stated. Assume xt1\mathbf{x}_{t-1} is fixed (i.e., it is constant).

E[xtxt1]=E[1βtxt1+βtϵt](xt1 is constant vector)=E[1βtxt1]+E[βtϵt]=1βtxt1+βtE[ϵt]=1βtxt1 \begin{align*} \mathbb{E} [\mathbf{x}_{t} | \mathbf{x}_{t-1}] &= \mathbb{E} \left[ \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \right] \qquad (\mathbf{x}_{t-1} \text{ is constant vector})\\ &= \mathbb{E} \left[ \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} \right] + \mathbb{E} \left[ \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \right] \\ &= \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\mathbb{E} \left[ \boldsymbol{\epsilon}_{t} \right] \\ &= \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} \end{align*}

Cov(xtxt1)=Cov(1βtxt1+βtϵt)(xt1 is constant vector)=Cov(βtϵt)=βtCov(ϵt)=βtI \begin{align*} \Cov (\mathbf{x}_{t} | \mathbf{x}_{t-1}) &= \Cov \left( \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \right) \qquad (\mathbf{x}_{t-1} \text{ is constant vector})\\ &= \Cov \left( \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \right) \\ &= \beta_{t}\Cov \left( \boldsymbol{\epsilon}_{t} \right) \\ &= \beta_{t} \mathbf{I} \end{align*}

Conditional to a given image x0\mathbf{x}_{0}, when Gaussian noise is repetitively added, the deteriorated image is x1,x2,,xT\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{T} (notice the same notations for this random variable). The probability density function for this entire process is given through the conditional joint probability density function q(x1:Tx0)q(\mathbf{x}_{1:T} | \mathbf{x}_{0}), referred to as the forward process or diffusion process. As x0:T\mathbf{x}_{0:T} is Markov, by (2)\eqref{2}, the diffusion process can be modeled as follows:

q(x1:Tx0)=t=1Tq(xtxt1)(4) q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) = \prod\limits_{t=1}^{T} q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) \tag{4}

Meanwhile, one can explicitly specify the conditional probability density function when xt\mathbf{x}_{t} is determined at any step tt from x0\mathbf{x}_{0}. Suppose it’s αt=1βt\alpha_{t} = 1 - \beta_{t}, then:

xt=αtxt1+1αtϵt=αt(αt1xt2+1αt1ϵt1)+1αtϵt=αtαt1xt2+αt1αt1ϵt1+1αtϵt \begin{align*} \mathbf{x}_{t} &= \sqrt{\alpha_{t}}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t} \\ &= \sqrt{\alpha_{t}}\left( \sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}}\boldsymbol{\epsilon}_{t-1} \right) + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t} \\ &= \sqrt{\alpha_{t}}\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{\alpha_{t}}\sqrt{1 - \alpha_{t-1}}\boldsymbol{\epsilon}_{t-1} + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t} \\ \end{align*}

The sum of two normal distributions can similarly be represented by one normal distribution:

If X1N(μ1,σ12)X_{1} \sim \mathcal{N}(\mu_{1}, \sigma_{1}^{2}) and X2N(μ2,σ22)X_{2} \sim \mathcal{N}(\mu_{2}, \sigma_{2}^{2}) are independent, X1+X2N(μ1+μ2,σ12+σ22) X_{1} + X_{2} \sim \mathcal{N}(\mu_{1} + \mu_{2}, \sigma_{1}^{2} + \sigma_{2}^{2}) The same logic applies to random vectors.

Therefore, the following holds:

αt1αt1ϵt1+1αtϵtN(0,αt(1αt1)I+(1αt)I)=N(0,(1αtαt1)I) \begin{align*} \sqrt{\alpha_{t}}\sqrt{1 - \alpha_{t-1}}\boldsymbol{\epsilon}_{t-1} + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t} &\sim \mathcal{N}(\mathbf{0}, \alpha_{t}(1-\alpha_{t-1})\mathbf{I} + (1 - \alpha_{t})\mathbf{I}) \\ &\quad= \mathcal{N}(\mathbf{0}, (1 - \alpha_{t}\alpha_{t-1})\mathbf{I}) \end{align*}

Thus, xt\mathbf{x}_{t} is given by:

xt=αtαt1xt2+1αtαt1ϵ,ϵN(0,I) \mathbf{x}_{t} = \sqrt{\alpha_{t}\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t}\alpha_{t-1}}\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

Repeating this process yields the following:

xt=αtαt1xt2+1αtαt1ϵ=αtαt1(αt2xt3+1αt2ϵt2)+1αtαt1ϵ=αtαt1αt2xt3+αtαt1(1αt2)ϵt2+1αtαt1ϵ=αtαt1αt2xt3+1αtαt1αt2ϵ=s=1tαsx0+1s=1tαsϵ=αtx0+1αtϵ,ϵN(0,I),αt=s=1tαs \begin{align*} \mathbf{x}_{t} &= \sqrt{\alpha_{t}\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t}\alpha_{t-1}}\boldsymbol{\epsilon} \\ &= \sqrt{\alpha_{t}\alpha_{t-1}}(\sqrt{\alpha_{t-2}}\mathbf{x}_{t-3} + \sqrt{1 - \alpha_{t-2}}\boldsymbol{\epsilon}_{t-2}) + \sqrt{1 - \alpha_{t}\alpha_{t-1}}\boldsymbol{\epsilon} \\ &= \sqrt{\alpha_{t}\alpha_{t-1}\alpha_{t-2}}\mathbf{x}_{t-3} + \sqrt{\alpha_{t}\alpha_{t-1}(1 - \alpha_{t-2})}\boldsymbol{\epsilon}_{t-2} + \sqrt{1 - \alpha_{t}\alpha_{t-1}}\boldsymbol{\epsilon} \\ &= \sqrt{\alpha_{t}\alpha_{t-1}\alpha_{t-2}}\mathbf{x}_{t-3} + \sqrt{1 - \alpha_{t}\alpha_{t-1}\alpha_{t-2}}\boldsymbol{\epsilon} \\ &\vdots \\ &= \sqrt{\prod\limits_{s=1}^{t} \alpha_{s}}\mathbf{x}_{0} + \sqrt{1 - \prod\limits_{s=1}^{t} \alpha_{s}}\boldsymbol{\epsilon} \\ &= \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} + \sqrt{1 - \overline{\alpha}_{t}}\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad \overline{\alpha}_{t} = \prod\limits_{s=1}^{t} \alpha_{s} \tag{5} \end{align*}

Hence, given x0\mathbf{x}_{0}, the conditional joint density function concerning xt\mathbf{x}_{t} is given as follows:

q(xtx0)=N(αtx0,(1αt)I)(6) q(\mathbf{x}_{t} | \mathbf{x}_{0}) = \mathcal{N}(\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0}, (1 - \overline{\alpha}_{t})\mathbf{I}) \tag{6}

If αt\alpha_{t} is sufficiently small (i.e., βt\beta_{t} is close to 11) and TT is sufficiently large, αT=s=1Tαs\overline{\alpha}_{T} = \prod_{s=1}^{T} \alpha_{s} closely approximates 00, and thus, the following holds:

q(xTx0)N(0,I)as T q(\mathbf{x}_{T} | \mathbf{x}_{0}) \to \mathcal{N}(\mathbf{0}, \mathbf{I}) \quad \text{as } T \to \infty

These equations support the idea that xT\mathbf{x}_{T}, which arises by continuously adding Gaussian noise to x0\mathbf{x}_{0}, follows a standard normal distribution.

  • Denoising Process (Reverse Process)

Now consider the reverse process (denoising) that retrieves x0\mathbf{x}_{0} by gradually removing noise from Gaussian noise xTN(0,I)\mathbf{x}_{T} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). According to Feller4, the reverse process of forward diffusion exactly mirrors the forward process for distributions like normal or binomial, given sufficiently small diffusion coefficients β\beta.

“For both Gaussian and binomial diffusion, for continuous diffusion (limit of small step size β\beta) the reversal of the diffusion process has the identical functional form as the forward process.”

– Sohl-Dickstein, Jascha, et al. Deep unsupervised learning using nonequilibrium thermodynamics. International conference on machine learning. pmlr, 2015.

This implies that the probability density function of denoising, denoted pp, satisfies the following. Given that q(xtxt1)q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) follows a normal distribution, its reverse process is as follows:

p(xt1xt)=N(μ(xt,t),Σ(xt,t)) p(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}(\mathbf{\mu}(\mathbf{x}_{t}, t), \boldsymbol{\Sigma}(\mathbf{x}_{t}, t))

The reverse process is also a Markov chain, allowing the entire process of noise removal to be expressed as follows when starting from a standard normal noise xT\mathbf{x}_{T} to result in a specific image x0\mathbf{x}_{0}:

p(x0:T)=p(xT)t=1Tp(xt1xt) p(\mathbf{x}_{0:T}) = p(\mathbf{x}_{T}) \prod\limits_{t=1}^{T} p(\mathbf{x}_{t-1} | \mathbf{x}_{t})

Unlike the diffusion process q(x0:T)q(\mathbf{x}_{0:T}), the denoising process p(x0:T)p(\mathbf{x}_{0:T}) is unknown and needs to be estimated. Let us denote the approximation function with parameters θ\theta as pθp_{\theta} for pp. Knowing that the probability density function with respect to xT\mathbf{x}_{T} is p(xT)=N(0,I)p(\mathbf{x}_{T}) = \mathcal{N}(\mathbf{0}, \mathbf{I}), we have:

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt)(7) p_{\theta} (\mathbf{x}_{0:T}) = p(\mathbf{x}_{T}) \prod\limits_{t=1}^{T} p_{\theta} (\mathbf{x}_{t-1} | \mathbf{x}_{t}) \tag{7}

While we know that each p(xt1xt)p(\mathbf{x}_{t-1} | \mathbf{x}_{t}) follows a normal distribution, its mean vector and covariance matrix are unknown, and therefore also need estimation. Consequently, pθ(xt1xt)p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) can be represented as follows:

pθ(xt1xt)=N(μθ(xt,t),Σθ(xt,t)) p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}(\mu_{\theta}(\mathbf{x}_{t}, t), \Sigma_{\theta}(\mathbf{x}_{t}, t))

  • Loss Function

To generate new data, one seeks the probability distribution p(x0)p(\mathbf{x}_{0}) with respect to the data x0\mathbf{x}_{0}. Knowing this allows for generating new data through random sampling that follows the same distribution as existing data. The approximating function of p(x0)p(\mathbf{x}_{0}) is pθ(x0)p_{\theta}(\mathbf{x}_{0}) with parameters θ\theta. To make pθ(x0)p_{\theta}(\mathbf{x}_{0}) similar to p(x0)p(\mathbf{x}_{0}), one considers pθ(x0)p_{\theta}(\mathbf{x}_{0}) as a likelihood and employs the maximum likelihood estimation method. Since pp is assumed normal, leverage log-likelihood for mathematical simplification; this naturally bridges with relative entropy (Kullback-Leibler divergence). Maximizing logpθ(x0)\log p_{\theta}(\mathbf{x}_{0}) is tantamount to minimizing logpθ(x0)-\log p_{\theta}(\mathbf{x}_{0}), which is our starting point (gradient descent will be used). Hence, the aim is to find θ\theta that minimizes Eq(x0)[logpθ(x0)]\mathbb{E}_{q(\mathbf{x}_{0})}[-\log p_{\theta}(\mathbf{x}_{0})]. By the definition of marginal/conditional probability density functions, p(x)p(\mathbf{x}) is expressed as follows:

pθ(x0)=pθ(x0:T)dx1:T=p(x0:T)q(x1:Tx0)q(x1:Tx0)dx1:T=q(x1:Tx0)p(x0:T)q(x1:Tx0)dx1:T=Eq(x1:Tx0)[p(x0:T)q(x1:Tx0)] \begin{align*} p_{\theta}(\mathbf{x}_{0}) &= \int p_{\theta}(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T} \\ &= \int p(\mathbf{x}_{0:T}) \dfrac{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} d\mathbf{x}_{1:T} \\ &= \int q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) \dfrac{ p(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} d\mathbf{x}_{1:T} \\ &= \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \left[ \dfrac{ p(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right] \end{align*}

We can breakdown the expectation we intend to minimize as follows. By Jensen’s inequality, log(E[X])E[logX]-\log (\mathbb{E}[X]) \le \mathbb{E}[-\log X] implies:

Eq(x0)[logpθ(x0)]=Eq(x0)[log(Eq(x1:Tx0)[pθ(x0:T)q(x1:Tx0)])]Eq(x0)[Eq(x1:Tx0)(logpθ(x0:T)q(x1:Tx0))]=Eq(x0:T)[logpθ(x0:T)q(x1:Tx0)] \begin{align*} \mathbb{E}_{q(\mathbf{x}_{0})}[-\log p_{\theta}(\mathbf{x}_{0})] &= \mathbb{E}_{q(\mathbf{x}_{0})} \left[ -\log \left( \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \left[ \dfrac{ p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right] \right) \right] \\ &\le \mathbb{E}_{q(\mathbf{x}_{0})} \left[ \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})}\left( -\log \dfrac{ p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ -\log \dfrac{ p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right] \\ \end{align*}

The final equality holds owing to the definition of conditional probability density functions q(X,Y)=q(X)q(YX)q(X,Y) = q(X)q(Y|X). Envisioning the two forms of expectations as integral representations might offer clearer insights. Substituting (4)(4) and (7)(7) into the equation yields:

Eq(x0)[logpθ(x0)]Eq(x0:T)[logpθ(x0:T)q(x1:Tx0)]=Eq(x0:T)[logp(xT)t=1Tpθ(xt1xt)t=1Tq(xtxt1)]=Eq(x0:T)[logp(xT)t=1Tlogpθ(xt1xt)q(xtxt1)]=:L \begin{align*} \mathbb{E}_{q(\mathbf{x}_{0})}[-\log p_{\theta}(\mathbf{x}_{0})] &\le \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ -\log \dfrac{ p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ -\log \dfrac{ p(\mathbf{x}_{T}) \prod\limits_{t=1}^{T} p_{\theta} (\mathbf{x}_{t-1} | \mathbf{x}_{t})}{\prod\limits_{t=1}^{T} q(\mathbf{x}_{t} | \mathbf{x}_{t-1})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ -\log p(\mathbf{x}_{T}) - \sum\limits_{t=1}^{T} \log \dfrac{ p_{\theta} (\mathbf{x}_{t-1} | \mathbf{x}_{t})}{ q(\mathbf{x}_{t} | \mathbf{x}_{t-1})} \right] =: L \end{align*}

The last equality is valid due to the properties of logarithms. While the left term is our actual minimization goal, it is uncomputable due to the unknown distribution q(x0)q(\mathbf{x}_{0}) of real data. But examining the right-hand side reveals that it involves computable terms, namely p(xT)p(\mathbf{x}_{T}) and q(xtxt1)q(\mathbf{x}_{t} | \mathbf{x}_{t-1}); the first, being assumed normal, is a known quantity. The second is derived through actual execution of the diffusion process, thus by minimizing computable quantities on the right, the inequality implies an indirect minimization of the left.

Through algebraic manipulation, the loss function can be tailored for computational efficiency. Presently, the inclusion of logpθ(xt1xt)q(xtxt1)\log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t} | \mathbf{x}_{t-1})} in LL necessitates sampling for pθ(xt1xt)p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}), introducing variance in real data that destabilizes the learning process. Let’s amend the formula within Kullback-Leibler divergence (KLD) for representation. Assuming normal distributions for pθ(xt1xt)p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) and q(xtxt1)q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) yields a KLD closed-form solution by the formula below, allowing exact loss computation sans variance, yielding robust and efficient learning.

Relative Entropy of Normal Distributions:

The relative entropy between two multivariate normal distributions N(μ,Σ)N(\boldsymbol{\mu}, \Sigma) and N(μ1,Σ1)N(\boldsymbol{\mu_{1}}, \Sigma_{1}) is given by the following. With μ,μ1RD\boldsymbol{\mu}, \boldsymbol{\mu}_{1} \in \mathbb{R}^{D}:

DKL(N(μ,Σ)N(μ1,Σ1))=12[log(ΣΣ1)+Tr(Σ11Σ)+(μμ1)TΣ11(μμ1)D] \begin{array}{l} D_{\text{KL}}\big( N(\boldsymbol{\mu}, \Sigma) \| N(\boldsymbol{\mu_{1}}, \Sigma_{1}) \big) \\[1em] = \dfrac{1}{2} \left[ \log \left( \dfrac{|\Sigma|}{|\Sigma_{1}|} \right) + \Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}) - D \right] \end{array}

L=Eq(x0:T)[logp(xT)t=1Tlogpθ(xt1xt)q(xtxt1)]=Eq(x0:T)[logp(xT)t=2Tlogpθ(xt1xt)q(xtxt1)logpθ(x0x1)q(x1x0)] \begin{align*} L &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=1}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t} | \mathbf{x}_{t-1})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t} | \mathbf{x}_{t-1})} - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right] \end{align*}

Examining the numerator of the second term, the following simplification is possible:

q(xtxt1)=q(xtxt1,x0)by Markov property=q(xt,xt1,x0)q(xt1,x0)by def. of conditional pdf=q(xt1xt,x0)q(xt,x0)q(xt1,x0)by def. of conditional pdf=q(xt1xt,x0)q(xt,x0)q(xt1,x0)=q(xt1xt,x0)q(xt,x0)p(x0)q(xt1,x0)p(x0)=q(xt1xt,x0)q(xt,x0)p(x0)p(x0)q(xt1,x0)=q(xt1xt,x0)q(xtx0)q(xt1x0)by def. of conditional pdf \begin{align*} q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) &= q(\mathbf{x}_{t} | \mathbf{x}_{t-1}, \mathbf{x}_{0}) &\text{by Markov property} \\ &= \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{t-1}, \mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0})} &\text{by def. of conditional pdf} \\ &= \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) q(\mathbf{x}_{t}, \mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0})} &\text{by def. of conditional pdf} \\ &= q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0})} & \\ &= q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{0}) p(\mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0}) p(\mathbf{x}_{0})} & \\ &= q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{0}) }{ p(\mathbf{x}_{0})} \dfrac{p(\mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0})} & \\ &= q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \dfrac{q(\mathbf{x}_{t}| \mathbf{x}_{0})}{q(\mathbf{x}_{t-1}| \mathbf{x}_{0})} &\text{by def. of conditional pdf} \end{align*}

Substituting gives the formulation:

L=Eq(x0:T)[logp(xT)t=2Tlog(pθ(xt1xt)q(xt1xt,x0)q(xt1x0)q(xtx0))logpθ(x0x1)q(x1x0)]=Eq(x0:T)[logp(xT)t=2Tlogpθ(xt1xt)q(xt1xt,x0)t=2Tlogq(xt1x0)q(xtx0)logpθ(x0x1)q(x1x0)]=Eq(x0:T)[logp(xT)t=2Tlogpθ(xt1xt)q(xt1xt,x0)log(q(xT1x0)q(xTx0)q(xT2x0)q(xT1x0)q(x1x0)q(x2x0))logpθ(x0x1)q(x1x0)]=Eq(x0:T)[logp(xT)t=2Tlogpθ(xt1xt)q(xt1xt,x0)log(q(xT1x0)q(xTx0)q(xT2x0)q(xT1x0)q(x1x0)q(x2x0)pθ(x0x1)q(x1x0))]=Eq(x0:T)[logp(xT)t=2Tlogpθ(xt1xt)q(xt1xt,x0)log(q(xT1x0)q(xTx0)q(xT1x0)q(xT1x0)q(x1x0)q(x2x0)pθ(x0x1)q(x1x0))]=Eq(x0:T)[logp(xT)t=2Tlogpθ(xt1xt)q(xt1xt,x0)logpθ(x0x1)q(xTx0)]=Eq(x0:T)[logp(xT)q(xTx0)t=2Tlogpθ(xt1xt)q(xt1xt,x0)logpθ(x0x1)]=Eq(x0:T)[logq(xTx0)p(xT)+t=2Tlogq(xt1xt,x0)pθ(xt1xt)logpθ(x0x1)](8)=Eq(x0:T)[logq(xTx0)p(xT)]+t=2TEq(x0:T)[logq(xt1xt,x0)pθ(xt1xt)]Eq(x0:T)[logpθ(x0x1)] \begin{align*} L &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \left( \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}\dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \right) - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} - \sum\limits_{t=2}^{T} \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} \right. \\ &\qquad\qquad\qquad \left. -\log \left( \dfrac{q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} \cdot \dfrac{q(\mathbf{x}_{T-2} | \mathbf{x}_{0})}{q(\mathbf{x}_{T-1} | \mathbf{x}_{0})} \cdots \dfrac{q(\mathbf{x}_{1} | \mathbf{x}_{0})}{q(\mathbf{x}_{2} | \mathbf{x}_{0})} \right) - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} \right. \\ &\qquad\qquad\qquad \left. - \log \left( \dfrac{q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} \cdot \dfrac{q(\mathbf{x}_{T-2} | \mathbf{x}_{0})}{q(\mathbf{x}_{T-1} | \mathbf{x}_{0})} \cdots \dfrac{q(\mathbf{x}_{1} | \mathbf{x}_{0})}{q(\mathbf{x}_{2} | \mathbf{x}_{0})} \cdot \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} \right. \\ &\qquad\qquad\qquad \left. - \log \left( \dfrac{\color{red}{\cancel{\color{black}q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}}}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} \cdot \dfrac{\color{green}{\bcancel{\color{black}q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}}}{\color{red}{\cancel{\color{black}q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}}} \cdots \dfrac{\color{purple}{\cancel{\color{black}q(\mathbf{x}_{1} | \mathbf{x}_{0})}}}{\color{green}{\bcancel{\color{black}q(\mathbf{x}_{2} | \mathbf{x}_{0})}}} \cdot \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{\color{purple}{\cancel{\color{black}q(\mathbf{x}_{1} | \mathbf{x}_{0})}}} \right) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log \dfrac{p(\mathbf{x}_{T})}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} - \log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} + \sum\limits_{t=2}^{T} \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} - \log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) \right] \\ (8) &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} \right] + \sum\limits_{t=2}^{T} \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} \right] - \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) \right] \tag{8} \end{align*}

The first schema of (8)(8) can be reformulated using KLD, as shown below.

Eq(x0:T)[logq(xTx0)p(xT)]=q(x0:T)logq(xTx0)p(xT)dx0:T=q(x0)q(x1:Tx0)logq(xTx0)p(xT)dx0:Tby def. of conditional pdf=q(x0)q(x1:Tx0)logq(xTx0)p(xT)dx1:T1dxTdx0=q(x0)q(xTx0)logq(xTx0)p(xT)dxTdx0by def. of marginal pdf=Eq(x0)[q(xTx0)logq(xTx0)p(xT)dxT]=Eq(x0)[DKL(q(xTx0)p(xT))] \begin{align*} &\mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} \right] \\ &= \int q(\mathbf{x}_{0:T}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{0:T} \\ &= \int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{0:T} &\text{by def. of conditional pdf} \\ &= \int\int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{1:T-1} d\mathbf{x}_{T} d\mathbf{x}_{0} \\ &= \int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{T} d\mathbf{x}_{0} &\text{by def. of marginal pdf} \\ &= \mathbb{E}_{q(\mathbf{x}_{0})}\left[ \int q(\mathbf{x}_{T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{T} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{T} | \mathbf{x}_{0}) \| p(\mathbf{x}_{T}) \right) \Big] \end{align*}

Note that the paper casually writes this as Eq(x0:T)=Eq=Eq(x0)\mathbb{E}_{q(\mathbf{x}_{0:T})} = \mathbb{E}_{q} = \mathbb{E}_{q(\mathbf{x}_{0})}. While the expected value inside solely depends on x0\mathbf{x}_{0}, resulting in the same outcome even with unusal integrals on arbitrary variables, this prompts meticulous reader awareness due to absent explanation in the paper.

Now, considering the second term in (8)(8). Similar to the first term, we seek expressions representing q(xt1xt,x0)q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) with q(x0:T)q(\mathbf{x}_{0:T}) reordering to employ a KLD form.

t=2TEq(x0:T)[logq(xt1xt,x0)pθ(xt1xt)]=t=2Tq(x0:T)logq(xt1xt,x0)pθ(xt1xt)dx0:T=t=2Tq(x0)q(x1:Tx0)logq(xt1xt,x0)pθ(xt1xt)dx0:T=t=2Tq(x0)q(x1:t1,xt+1:Txt,x0)q(xtx0)logq(xt1xt,x0)pθ(xt1xt)dx0:T=t=2Tq(x0)q(x1:t1,xt+1:Txt,x0)q(xtx0)logq(xt1xt,x0)pθ(xt1xt)dx(1:t2,t+1:T)dxt1:tdx0=t=2Tq(x0)q(xt1xt,x0)q(xtx0)logq(xt1xt,x0)pθ(xt1xt)dxt1:tdx0=t=2Tq(x0)q(xtx0)DKL(q(xt1xt,x0)pθ(xt1xt))dxtdx0=q(x0)[t=2Tq(xtx0)DKL(q(xt1xt,x0)pθ(xt1xt))dxt]dx0=Eq(x0)[t=2TEq(xtx0)[DKL(q(xt1xt,x0)pθ(xt1xt))]] \begin{align*} &\sum\limits_{t=2}^{T} \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} \right] \\ &= \sum\limits_{t=2}^{T} \int q(\mathbf{x}_{0:T}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{0:T} \\ &= \sum\limits_{t=2}^{T} \int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{0:T} \\ &= \sum\limits_{t=2}^{T} \int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:t-1}, \mathbf{x}_{t+1:T} | \mathbf{x}_{t}, \mathbf{x}_{0}) q(\mathbf{x}_{t} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{0:T} \\ &= \sum\limits_{t=2}^{T} \int\int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:t-1}, \mathbf{x}_{t+1:T} | \mathbf{x}_{t}, \mathbf{x}_{0}) q(\mathbf{x}_{t} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{(1:t-2,t+1:T)} d\mathbf{x}_{t-1:t} d\mathbf{x}_{0} \\ &= \sum\limits_{t=2}^{T} \int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{t-1}| \mathbf{x}_{t}, \mathbf{x}_{0}) q(\mathbf{x}_{t} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{t-1:t} d\mathbf{x}_{0} \\ &= \sum\limits_{t=2}^{T} \int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{t} | \mathbf{x}_{0}) D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) d\mathbf{x}_{t} d\mathbf{x}_{0} \\ &= \int q(\mathbf{x}_{0}) \left[ \sum\limits_{t=2}^{T} \int q(\mathbf{x}_{t} | \mathbf{x}_{0}) D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) d\mathbf{x}_{t} \right] d\mathbf{x}_{0} \\ &= \mathbb{E}_{q(\mathbf{x}_{0})} \left[ \sum\limits_{t=2}^{T} \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \Big] \right] \\ \end{align*}

Once again, beware of the notation overuse where it’s written as Eq[t=2TDKL(q(xt1xt,x0)pθ(xt1xt))]\displaystyle \mathbb{E}_{q} \left[ \sum\limits_{t=2}^{T} D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \right] in the paper. Utilizing superfluous integral adjustments on non-contributing variables reflects precisely the paper’s notation.

Finally, the entire schema of (8)(8) is reconstructed to denote KLD forms:

L=Eq(x0)[DKL(q(xTx0)p(xT))]+Eq(x0)[t=2TEq(xtx0)[DKL(q(xt1xt,x0)pθ(xt1xt))]]Eq(x0:T)[logpθ(x0x1)] \begin{align*} &L = \mathbb{E}_{q(\mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{T} | \mathbf{x}_{0}) \| p(\mathbf{x}_{T}) \right) \Big] \\ &\qquad +\mathbb{E}_{q(\mathbf{x}_{0})} \left[ \sum\limits_{t=2}^{T} \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \Big] \right] - \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) \right] \end{align*}

An encompassing term construction involving integrational trickery allows for the expression as:

Eq(x0:T)[DKL(q(xTx0)p(xT))LT+t=2T[DKL(q(xt1xt,x0)pθ(xt1xt))]Lt1logpθ(x0x1)L0] \mathbb{E}_{q(\mathbf{x}_{0:T})} \bigg[ \underbrace{D_{\text{KL}} \left( q(\mathbf{x}_{T} | \mathbf{x}_{0}) \| p(\mathbf{x}_{T}) \right)}_{L_{T}} + \sum\limits_{t=2}^{T} \underbrace{\Big[ D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \Big]}_{L_{t-1}} - \underbrace{\log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}_{L_{0}} \bigg]

The derivation for q(xt1xt,x0)q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) can be executed like this:

Properties of Conditional Probability

p(xy,z)=p(x,yz)p(yz) p(\mathbf{x} | \mathbf{y}, \mathbf{z}) = \dfrac{p(\mathbf{x}, \mathbf{y} | \mathbf{z}) }{p(\mathbf{y} | \mathbf{z})}

q(xt1xt,x0)=q(xt,xt1x0)q(xtx0)=q(xtxt1,x0)q(xt1x0)q(xtx0)=q(xtxt1)q(xt1x0)q(xtx0) \begin{align*} q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) &= \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \\ &= \dfrac{q(\mathbf{x}_{t} | \mathbf{x}_{t-1}, \mathbf{x}_{0}) q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \\ &= \dfrac{q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \end{align*}

The final equality is valid due to the Markov assumption of {xt}\left\{ \mathbf{x}_{t} \right\}. From earlier computations concerning probability density functions (3)(3), (6)(6), substitution directly follows.

q(xt1xt,x0)=q(xtxt1)q(xt1x0)q(xtx0)=Cexp(12(xtαtxt1)2(1αt))exp(12(xt1αt1x0)2(1αt1))exp(12(xtαtx0)2(1αt))=Cexp[12(11αtxtTxt2αt1αtxtTxt1+αt1αtxt1Txt1+11αt1xt1Txt12αt11αt1x0Txt1+αt11αt1x0Tx011αtxtTxt+2αt1αtxtTx0αt1αtx0Tx0)] \begin{align*} &q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \\ &= \dfrac{q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \\ &= C \dfrac{\exp\left( - \dfrac{1}{2} \dfrac{\left( \mathbf{x}_{t} - \sqrt{\alpha_{t}}\mathbf{x}_{t-1} \right)^{2}}{(1-\alpha_{t})} \right) \exp\left( - \dfrac{1}{2} \dfrac{\left( \mathbf{x}_{t-1} - \sqrt{\overline{\alpha}_{t-1}}\mathbf{x}_{0} \right)^{2}}{(1-\overline{\alpha}_{t-1})} \right)}{\exp\left( - \dfrac{1}{2} \dfrac{\left( \mathbf{x}_{t} - \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} \right)^{2}}{(1-\overline{\alpha}_{t})} \right)} \\ &= C \exp\left[ -\dfrac{1}{2} \left( \dfrac{1}{1-\alpha_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} - \dfrac{2\sqrt{\alpha_{t}}}{1-\alpha_{t}} \mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t-1} + \dfrac{\alpha_{t}}{1-\alpha_{t}} \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} \right.\right.\\ &\qquad\qquad\qquad \quad + \dfrac{1}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - \dfrac{2\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{t-1} + \dfrac{\overline{\alpha_{t-1}}}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0} \\ &\qquad\qquad\qquad\quad \left.\left. - \dfrac{1}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + \dfrac{2\sqrt{\overline{\alpha}_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} - \dfrac{\overline{\alpha_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0} \right)\right] \tag{9} \end{align*}

Here, C=1(2π(1αt)(1αt)(1αt))DC = \dfrac{1}{\sqrt{ \left(2\pi \dfrac{(1-\alpha_{t})(1-\overline{\alpha_{t}})}{(1-\overline{\alpha_{t}})} \right)^{D} }} is constant throughout. As we aim to find q(xt1xt,x0)q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}), rearrange the exponent related to xt1\mathbf{x}_{t-1}:

(αt1αt+11αt1)xt1Txt12(αt1αtxtT+αt11αt1x0T)xt1+[(11αt11αt)xtTxt+2αt1αtxtTx0+(αt11αt1αt1αt)x0Tx0] \begin{align*} &\left( \dfrac{\alpha_{t}}{1-\alpha_{t}} + \dfrac{1}{1-\overline{\alpha}_{t-1}} \right) \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - 2\left( \dfrac{\sqrt{\alpha_{t}}}{1-\alpha_{t}} \mathbf{x}_{t}^{\mathsf{T}} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{0}^{\mathsf{T}} \right)\mathbf{x}_{t-1} \\ &\qquad + \left[ \left( \dfrac{1}{1-\alpha_{t}} - \dfrac{1}{1-\overline{\alpha}_{t}} \right)\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\dfrac{\sqrt{\overline{\alpha}_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \left( \dfrac{\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t-1}} - \dfrac{\overline{\alpha}_{t}}{1-\overline{\alpha}_{t}} \right)\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \end{align*}

Combine constants from each term through common denominators:

(αt1αt+11αt1)=αtαtαt1+1αt(1αt)(1αt1)=1αt(1αt)(1αt1) \left( \dfrac{\alpha_{t}}{1-\alpha_{t}} + \dfrac{1}{1-\overline{\alpha}_{t-1}} \right) = \dfrac{\alpha_{t} - \alpha_{t}\overline{\alpha}_{t-1} + 1 - \alpha_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} = \dfrac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}

(11αt11αt)=αtαt(1αt)(1αt)=αt(1αt1)(1αt)(1αt) \left( \dfrac{1}{1-\alpha_{t}} - \dfrac{1}{1-\overline{\alpha}_{t}} \right) = \dfrac{\alpha_{t} - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t})} = \dfrac{\alpha_{t}(1-\overline{\alpha}_{t-1})}{(1-\alpha_{t})(1-\overline{\alpha}_{t})}

(αt11αt1αt1αt)=αt1αt(1αt1)(1αt)=αt1(1αt)(1αt1)(1αt) \left( \dfrac{\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t-1}} - \dfrac{\overline{\alpha}_{t}}{1-\overline{\alpha}_{t}} \right) = \dfrac{\overline{\alpha}_{t-1} - \overline{\alpha}_{t}}{(1-\overline{\alpha}_{t-1})(1-\overline{\alpha}_{t})} = \dfrac{\overline{\alpha}_{t-1}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t-1})(1-\overline{\alpha}_{t})}

Next, arrange the terms under:

1αt(1αt)(1αt1)xt1Txt12(αt1αtxtT+αt11αt1x0T)xt1+[αt(1αt1)(1αt)(1αt)xtTxt+2αt1αtxtTx0+αt1(1αt)(1αt1)(1αt)x0Tx0] \begin{align*} & \dfrac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - 2\left( \dfrac{\sqrt{\alpha_{t}}}{1-\alpha_{t}} \mathbf{x}_{t}^{\mathsf{T}} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{0}^{\mathsf{T}} \right)\mathbf{x}_{t-1} \\ &\qquad + \left[ \dfrac{\alpha_{t}(1-\overline{\alpha}_{t-1})}{(1-\alpha_{t})(1-\overline{\alpha}_{t})}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\dfrac{\sqrt{\overline{\alpha}_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \dfrac{\overline{\alpha}_{t-1}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t-1})(1-\overline{\alpha}_{t})}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \end{align*}

Combining constants under xt1Txt1\mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} gives another structuring stride:

1αt(1αt)(1αt1)(xt1Txt12((1αt1)αt1αtxtT+(1αt)αt11αtx0T)xt1+[αt(1αt1)2(1αt)2xtTxt+2(1αt)(1αt1)αt(1αt)2xtTx0+αt1(1αt)2(1αt)2x0Tx0])(10) \begin{align*} & \frac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} \left( \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - 2 \left( \frac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}}}{1 - \overline{\alpha}_{t}} \mathbf{x}_{t}^{\mathsf{T}} + \frac{(1-\alpha_{t})\overline{\alpha}_{t-1}}{1 - \overline{\alpha}_{t}}\mathbf{x}_{0}^{\mathsf{T}} \right)\mathbf{x}_{t-1} \right. \\ &\qquad +\left. \left[ \frac{\alpha_{t}(1-\overline{\alpha}_{t-1})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\frac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})\sqrt{\overline{\alpha}_{t}}}{(1 - \overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \frac{\overline{\alpha}_{t-1}(1 - \alpha_{t})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \right) \end{align*} \tag{10}

Recasting the expression yields:

[αt(1αt1)2(1αt)2xtTxt+2(1αt)(1αt1)αtαt1(1αt)2xtTx0+αt1(1αt)2(1αt)2x0Tx0]=[αt2(1αt1)2(1αt)2xtTxt+2(1αt)(1αt1)αtαt1(1αt)2xtTx0+αt12(1αt)2(1αt)2x0Tx0]=[αt(1αt1)(1αt)xt+αt1(1αt)(1αt)x0]2 \begin{align*} &\left[ \dfrac{\alpha_{t}(1-\overline{\alpha}_{t-1})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\dfrac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}\overline{\alpha}_{t-1}}}{(1 - \overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \dfrac{\overline{\alpha}_{t-1}(1 - \alpha_{t})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \\ &=\left[ \dfrac{\sqrt{\alpha_{t}}^{2}(1-\overline{\alpha}_{t-1})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\dfrac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}\overline{\alpha}_{t-1}}}{(1 - \overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}^{2}(1 - \alpha_{t})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \\ &=\left[ \dfrac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}\right]^{2} \\ \end{align*}

Plugging this into (10)(10) returns:

1αt(1αt)(1αt1)(xt1Txt12((1αt1)αt1αtxtT+(1αt)αt11αtx0T)xt1+[αt(1αt1)(1αt)xt+αt1(1αt)(1αt)x0]2)=1αt(1αt)(1αt1)(xt[αt(1αt1)(1αt)xt+αt1(1αt)(1αt)x0])2 \begin{align*} & \dfrac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} \left( \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - 2 \left( \dfrac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}}}{1 - \overline{\alpha}_{t}} \mathbf{x}_{t}^{\mathsf{T}} + \dfrac{(1-\alpha_{t})\overline{\alpha}_{t-1}}{1 - \overline{\alpha}_{t}}\mathbf{x}_{0}^{\mathsf{T}} \right)\mathbf{x}_{t-1} \right. \\ &\qquad \left. +\left[ \dfrac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}\right]^{2} \right) \\ &= \dfrac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} \left( \mathbf{x}_{t} - \left[ \dfrac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}\right] \right)^{2} \end{align*}

Reintegrating this with (9)(9) gives:

q(xt1xt,x0)=1(2π(1αt)(1αt)(1αt))Dexp[12(xt[αt(1αt1)(1αt)xt+αt1(1αt)(1αt)x0])2(1αt)(1αt1)(1αt)] \begin{array}{l} q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \\ = \frac{1}{\sqrt{ \left(2\pi \frac{(1-\alpha_{t})(1-\overline{\alpha_{t}})}{(1-\overline{\alpha_{t}})} \right)^{D} }} \exp\left[ -\dfrac{1}{2} \dfrac{\left( \mathbf{x}_{t} - \left[ \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}\right] \right)^{2}}{\frac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}{(1 - \overline{\alpha}_{t})}} \right] \end{array}

This bears resemblance to:

q(xt1xt,x0)=N(αt(1αt1)(1αt)xt+αt1(1αt)(1αt)x0,(1αt)(1αt1)(1αt)I)=N(μ~t(xt,x0),β~tI) \begin{align*} q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) &= \mathcal{N} \left( \dfrac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}, \dfrac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}{(1 - \overline{\alpha}_{t})}\mathbf{I} \right) \\ &= \mathcal{N} ( \tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}), \tilde{\beta}_{t} \mathbf{I}) \end{align*}

whereμ~t(xt,x0)=αt(1αt1)(1αt)xt+αt1(1αt)(1αt)x0, \text{where}\qquad \tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) = \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0},

β~t=(1αt1)βt(1αt) \tilde{\beta}_{t} = \frac{(1-\overline{\alpha}_{t-1}) \beta_{t}}{(1 - \overline{\alpha}_{t})}

3 Diffusion Models and Denoising Autoencoders

3.1 Forward Process and LTL_{T}

The paper fixes βt\beta_{t} as a constant instead of treating it as a learnable parameter. Hence, the expression for LTL_{T} lacks trainable parameters, allowing its omission during loss function implementation.

LT=DKL(q(xTx0)p(xT))=DKL[N(αtx0,(1αt)I)N(0,I)]=12[log(1αt)D+D(1αt)+αtx02D] \begin{align*} L_{T} &= D_{\text{KL}} \left( q(\mathbf{x}_{T} | \mathbf{x}_{0}) \| p(\mathbf{x}_{T}) \right) \\ &= D_{\text{KL}} \Big[ \mathcal{N} \left( \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0}, (1-\overline{\alpha}_{t}) \mathbf{I} \right) \| \mathcal{N} \left( 0, \mathbf{I} \right) \Big] \\ &= \dfrac{1}{2} \left[ \log (1-\overline{\alpha}_{t})^{D} + D(1-\overline{\alpha}_{t}) + \overline{\alpha}_{t}\mathbf{x}_{0}^{2} - D \right] \end{align*}

3.2 Reverse Process and L1:T1L_{1:T-1}

The authors stipulated the covariance matrix to be Σθ(xt,t)=σt2I\boldsymbol{\Sigma}_{\theta} (\mathbf{x}_{t}, t) = \sigma_{t}^{2} \mathbf{I} by default for pθ(xt1xt)=N(μθ(xt,t),Σθ(xt,t))p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}(\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t), \boldsymbol{\Sigma}_{\theta} (\mathbf{x}_{t}, t)) without learnable parameters. They found experimental outcomes were comparable in two distinct setups.

σt2=βtorσt2=β~t=1αt11αtβt \sigma_{t}^{2} = \beta_{t} \qquad \text{or} \qquad \sigma_{t}^{2} = \tilde{\beta}_{t} = \dfrac{1 - \overline{\alpha}_{t-1}}{1 - \overline{\alpha}_{t}} \beta_{t}

The left configuration, where x0N(0,I)\mathbf{x}_{0} \sim \mathcal{N} (\mathbf{0}, \mathbf{I}) extracted values optimise learning datasets, while, a single fixed x0N(x0,I)\mathbf{x}_{0} \sim \mathcal{N} (\mathbf{x}_{0}, \mathbf{I}) excels in right-side setups.

The structure of loss function Lt1L_{t-1} manifests as:

Lt1=Eq(xtx0)[DKL(q(xt1xt,x0)pθ(xt1xt))]=Eq(xtx0)[DKL(N(μ~t(xt,x0),β~tI)N(μθ(xt,t),σt2I))]=Eq(xtx0)[12(log(β~tσt2)D+Dβ~tσt2+(μ~t(xt,x0)μθ(xt,t))2σt2D)]=Eq(xtx0)[12σt2(μ~t(xt,x0)μθ(xt,t))2]+C2 \begin{align*} L_{t-1} &= \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( \mathcal{N}( \tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}), \tilde{\beta}_{t} \mathbf{I}) \| \mathcal{N}(\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t), \sigma_{t}^{2} \mathbf{I}) \right) \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \left[ \dfrac{1}{2} \left( \log \left( \dfrac{\tilde{\beta}_{t}}{\sigma_{t}^{2}} \right)^{D} + D\dfrac{\tilde{\beta}_{t}}{\sigma_{t}^{2}} + \dfrac{(\tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) - \boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t))^{2}}{\sigma_{t}^{2}} - D \right) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \left[ \dfrac{1}{2\sigma_{t}^{2}} (\tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) - \boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t))^{2}\right] + C_{2} \end{align*}

Here, C2C_{2} bears no parameter θ\theta reliance. Therefore, μθ\boldsymbol{\mu}_{\theta} is modeled to predict μ~t\tilde{\boldsymbol{\mu}}_{t}. Explicitly unfold μ~t\tilde{\boldsymbol{\mu}}_{t}, sharpening the learning target:

From (5)(5), relate xt\mathbf{x}_{t} with x0\mathbf{x}_{0} as xt(x0,ϵ)=αtx0+1αtϵ\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) = \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} + \sqrt{1-\overline{\alpha}_{t}}\boldsymbol{\epsilon} through:

μ~t(xt,x0)=αt(1αt1)(1αt)xt+αt1(1αt)(1αt)x0=αt(1αt1)(1αt)xt(x0,ϵ)+αt1(1αt)(1αt)(1αtxt(x0,ϵ)1αtαtϵ)=(αt(1αt1)(1αt)+αt1(1αt)(1αt)αt)xt(x0,ϵ)αt1(1αt)1αt(1αt)αtϵ=(αt(1αt1)αt(1αt)+(1αt)(1αt)αt)xt(x0,ϵ)(1αt)1αtαtϵ=1αt((αtαt)+(1αt)(1αt)xt(x0,ϵ)βt1αtϵ)=1αt(xt(x0,ϵ)βt1αtϵ) \begin{align*} \tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) &= \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0} \\ &= \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})} \left( \dfrac{1}{\sqrt{\overline{\alpha}_{t}}} \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \dfrac{\sqrt{1-\overline{\alpha}_{t}}}{\sqrt{\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) \\ &= \left( \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})} + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}_{t}}} \right)\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})\sqrt{1-\overline{\alpha}_{t}}}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \\ &= \left( \frac{\alpha_{t}(1-\overline{\alpha}_{t-1})}{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t})} + \frac{(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})\sqrt{\alpha_{t}}} \right)\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{(1 - \alpha_{t})}{\sqrt{1-\overline{\alpha}_{t}}\sqrt{\alpha_{t}}}\boldsymbol{\epsilon} \\ &= \dfrac{1}{\sqrt{\alpha_{t}}}\left( \frac{ (\alpha_{t} - \overline{\alpha}_{t}) + (1 - \alpha_{t})}{(1-\overline{\alpha}_{t})} \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) \\ &= \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) \\ \end{align*}

Thus, the expression for Lt1L_{t-1} rewrites:

Lt1=Ext(x0,ϵ)[12σt2[1αt(xt(x0,ϵ)βt1αtϵ)μθ(xt(x0,ϵ),t)]2]+C2(11) L_{t-1} = \mathbb{E}_{\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon})} \left[ \dfrac{1}{2\sigma_{t}^{2}} \left[ \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) - \boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}), t) \right]^{2} \right] + C_{2} \tag{11}

Ultimately, the learning target of μθ(xt,t)\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t) is 1αt(xt(x0,ϵ)βt1αtϵ)\dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right). Following the given input xt\mathbf{x}_{t} and fixed constant βt\beta_{t}, this underscores the embedding of ϵ=ϵt\boldsymbol{\epsilon} = \boldsymbol{\epsilon}_{t} as the network’s operative target. Parameter θ\theta being the sole dependency aligns with:

μθ(xt,t)=1αt(xtβt1αtϵθ(xt,t))(12) \boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t) = \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t} - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right) \tag{12}

Such presentation is intuitively obvious and structurally coherent, considering x0\mathbf{x}_{0} restoration through knowable additive noise recovery across ϵt\boldsymbol{\epsilon}_{t}. This intuitive portrayal substantiates the logical validity through strictly mathematical demonstration.

At p(xt1xt)=N(μθ,σt2I)p(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}( \boldsymbol{\mu}_{\theta}, \sigma_{t}^{2}\mathbf{I})’s mean vector, given xt\mathbf{x}_{t}, sampling xt1pθ(xt1xt)\mathbf{x}_{t-1} \sim p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) aligns with:

xt1=1αt(xtβt1αtϵθ(xt,t))+σtz,zN(0,I) \mathbf{x}_{t-1} = \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t} - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right) + \sigma_{t} \mathbf{z},\qquad \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\

Final integration of (12)(12) into (11)(11) offers:

Ext(x0,ϵ)[12σt2[1αt(xt(x0,ϵ)βt1αtϵ)μθ(xt(x0,ϵ),t)]2]=Ext[12σt2[1αt(xtβt1αtϵ)1αt(xtβt1αtϵθ(xt,t))]2]=Ext[12σt2[1αtβt1αtϵ1αtβt1αtϵθ(xt,t)]2]=Ext[βt22σt2αt(1αt)[ϵϵθ(xt,t)]2]=Ex0,ϵ[βt22σt2αt(1αt)[ϵϵθ(αtx0+1αtϵ,t)]2] \begin{align*} & \mathbb{E}_{\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon})} \left[ \dfrac{1}{2\sigma_{t}^{2}} \left[ \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) - \boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}), t) \right]^{2} \right] \\ &= \mathbb{E}_{\mathbf{x}_{t}} \left[ \dfrac{1}{2\sigma_{t}^{2}} \left[ \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t} - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) - \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t} - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right) \right]^{2} \right] \\ &= \mathbb{E}_{\mathbf{x}_{t}} \left[ \dfrac{1}{2\sigma_{t}^{2}} \left[ \dfrac{1}{\sqrt{\alpha_{t}}}\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} - \dfrac{1}{\sqrt{\alpha_{t}}}\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right]^{2} \right] \\ &= \mathbb{E}_{\mathbf{x}_{t}} \left[ \dfrac{\beta_{t}^{2}}{2\sigma_{t}^{2}\alpha_{t} (1 - \overline{\alpha}_{t})} \left[ \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right]^{2} \right] \\ &= \mathbb{E}_{\mathbf{x}_{0}, \boldsymbol{\epsilon}} \left[ \dfrac{\beta_{t}^{2}}{2\sigma_{t}^{2}\alpha_{t} (1 - \overline{\alpha}_{t})} \left[ \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} + \sqrt{1-\overline{\alpha}_{t}}\boldsymbol{\epsilon}, t) \right]^{2} \right] \\ \end{align*}

These precedents succinctly translate training and sampling process into pseudocode.

3.3 Data Scaling, Reverse Process Decoder, and L0L_{0}

All aforementioned discussions were on continuous probability density functions. Given that image data is a discrete variable within the range {0,1,,255}\left\{ 0, 1, \dots, 255 \right\}, proper scaling ensues. Initial rescaling linearly maps {0,1,,255}\left\{ 0, 1, \dots, 255 \right\} values to [1,1][-1, 1]. Ultimately, authors placed sampling’s final stage pθ(x0,x1)p_{\theta}(\mathbf{x}_{0}, \mathbf{x}_{1}) as:

pθ(x0x1)=i=1Dδ(x0i)δ+(x0i)N(x;μθi(x1,1),σ12)dx p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) = \prod\limits_{i = 1}^{D} \int\limits_{\delta_{-}(x_{0}^{i})}^{\delta_{+}(x_{0}^{i})} \mathcal{N}(x; \mu_{\theta}^{i}(\mathbf{x}_{1}, 1), \sigma_{1}^{2}) dx

δ+(x)={if x=1x+1255if x<1,δ(x)={x1255if x>1if x=1 \delta_{+}(x) = \begin{cases} \infty & \text{if } x = 1 \\ x + \frac{1}{255} & \text{if } x \lt 1 \end{cases}, \qquad \delta_{-}(x) = \begin{cases} x - \frac{1}{255} & \text{if } x \gt -1 \\ -\infty & \text{if } x = -1 \end{cases}

Denoting x0ix_{0}^{i} refers to the iith pixel of x0\mathbf{x}_{0}. Meaning a pixel value xx is interpreted as within the range of [x1255,x+1255][x - \frac{1}{255}, x + \frac{1}{255}].

3.4 Simplified Training Objective

In Section 3.2, we derived Lt1L_{t-1} to predict ϵ\boldsymbol{\epsilon} but found dropping preceding factors helped in practical application for both performance and implementation:

Lsimple(θ):=Et,x0,ϵ[(ϵϵθ(αtx0+1αtϵ,t))2] L_{\text{simple}}(\theta) := \mathbb{E}_{t, \mathbf{x}_{0}, \boldsymbol{\epsilon}} \left[ \left( \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} + \sqrt{1-\overline{\alpha}_{t}}\boldsymbol{\epsilon}, t) \right)^{2} \right]

From the pseudocode, note tt follows a uniform distribution drawn between 11 and TT.


  1. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851. ↩︎

  2. https://x.com/cyxacxa https://x.com/cyxacxa/status/1903757493987389938/photo/1 ↩︎

  3. Sohl-Dickstein, Jascha, et al. “Deep unsupervised learning using nonequilibrium thermodynamics.” International conference on machine learning. pmlr, 2015. ↩︎

  4. Feller, William. “Retracted chapter: On the theory of stochastic processes, with particular reference to applications.” Selected Papers I. Springer, Cham, 2015. 769-798. ↩︎