Paper Review: Denoising Diffusion Probabilistic Models (DDPM) 📂Machine Learning

Paper Review: Denoising Diffusion Probabilistic Models (DDPM)

Overview and Summary

A generative model refers to a method to find a probability distribution $Y$ that a given random sample $\left\{ y_{i} \right\}$ follows. As it’s highly challenging to directly find this from scratch, it’s common to use well-known distributions to approximate the desired distribution. Thus, if a dataset $\left\{ x_{i} \right\}$ following a well-known distribution $X$ is given, the generative model can be regarded as a function $f$ , and developing a generative model involves finding a function that closely approximates $f$ .

$f : \left\{ x_{i} \right\} \to \left\{ y_{j} \right\}$

The most commonly used well-known distribution is the normal distribution. Therefore, to understand generative models easily, one might consider it as a method to extract samples following an unknown distribution from a normal distribution.

The Denoising Diffusion Probabilistic Models (DDPM)¹ similarly introduces a method to extract data following an unknown distribution from a normal distribution. Consider a noise $\mathbf{z}$ extracted from the normal distribution $N(0, I)$ and an image $\mathbf{x}_{0}$ assumed to be extracted from some unknown distribution $X$ . While it’s challenging to find a function $\mathbf{x}_{0} = p(\mathbf{z})$ that creates $\mathbf{x}_{0}$ from $\mathbf{z}$ , finding the inverse function $\mathbf{z} = q(\mathbf{x}_{0})$ that degrades $\mathbf{x}_{0}$ to produce $\mathbf{z}$ is also not easy. (Photo credits²)

However, it is easy to obtain a slightly deteriorated image $\mathbf{x}_{1} = \mathbf{x}_{0} + \boldsymbol{\epsilon}_{1}$ by adding a tiny noise $\boldsymbol{\epsilon}_{1}$ extracted from a normal distribution to the image $\mathbf{x}_{0}$ . Extracting and adding small Gaussian noise $\boldsymbol{\epsilon}_{2}$ to obtain $\mathbf{x}_{2} = \mathbf{x}_{1} + \boldsymbol{\epsilon}_{2}$ is also easy. Repeating such a process many times, $\mathbf{x}_{T}$ would look like noise drawn from a normal distribution. In other words, continuously adding Gaussian noise to the image $\mathbf{x}_{0}$ results in an image similar to noise sampled from a normal distribution, a process termed diffusion. The process is called diffusion because, mathematically, the phenomenon of particles spreading out randomly, known as Brownian motion, can be described by continuously adding values drawn from a Gaussian distribution; real-world Brownian motion is related to the diffusion (heat) equation.

Since diffusion involves adding noise repeatedly, its reverse process involves subtracting noise, naturally leading to denoising as an inverse process. This paper introduces a method to perform the unknown denoising process by utilizing knowledge about the known diffusion process. The idea was initially introduced as diffusion probabilistic models (DPM) in a 2015 paper³: Deep Unsupervised Learning using Nonequilibrium Thermodynamics, and DDPM complements this with effective integration with deep learning.

The DDPM paper and its appendix are not very verbose with mathematical exposition. Even numerous reviews of this paper often fail to provide a clear explanation of how the formulas are derived. This article focuses meticulously on the mathematical description of DDPM, rather than its performance or results. Although the text is somewhat lengthy, there should be no obstacles in the derivation of the formulas.

Not a single line of formula has been glossed over. This is arguably the most mathematically accurate and detailed explanation of the DDPM paper’s formulas worldwide.

Preliminaries

In our paper, the random variable $\mathbf{x} \in \mathbb{R}^{D}$ and its realization are both denoted as $\mathbf{x} \in \mathbb{R}^{D}$ . Generally, the probability density function of a random variable $X$ is expressed as $p_{X}(x)$ , while in this paper, it will be denoted as $p(\mathbf{x}) = p_{\mathbf{x}}(\mathbf{x})$ . Feature Extraction

A sequence of random variables $(\mathbf{x}_{0}, \mathbf{x}_{1}, \dots, \mathbf{x}_{T}) = \left\{ \mathbf{x}_{t} \right\}_{t=0}^{T}$ is termed a stochastic process. In this paper, a stochastic process is denoted as $\mathbf{x}_{0:T} = (\mathbf{x}_{0}, \mathbf{x}_{1}, \dots, \mathbf{x}_{T})$ . The joint probability density function of $\mathbf{x}_{0:T}$ is expressed as follows:

$p(\mathbf{x}_{0:T}) = p_{\mathbf{x}_{0:T}}(\mathbf{x}_{0:T})$

Given $\mathbf{x}_{0:T-1}$ , the conditional probability density function of $\mathbf{x}_{T}$ is defined as $p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1})$ if it satisfies the following:

$p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1}) = \dfrac{p(\mathbf{x}_{0:T})}{p(\mathbf{x}_{0:T-1})}$

By repeating the above definition, we obtain the following.

$p(\mathbf{x}_{0:T}) = p(\mathbf{x}_{0}) p(\mathbf{x}_{1} | \mathbf{x}_{0}) p(\mathbf{x}_{2} | \mathbf{x}_{0:1}) \cdots p(\mathbf{x}_{T-1} | \mathbf{x}_{0:T-2}) p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1}) \tag{1}$

If $\mathbf{x}_{0:T}$ satisfies the following, it is termed a Markov chain.

$p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1}) = p(\mathbf{x}_{T} | \mathbf{x}_{T-1})$

If $\mathbf{x}_{0:T}$ is a Markov chain, then $(1)$ simplifies as follows:

$\begin{align*} p(\mathbf{x}_{0:T}) &= p(\mathbf{x}_{0}) p(\mathbf{x}_{1} | \mathbf{x}_{0}) p(\mathbf{x}_{2} | \mathbf{x}_{0:1}) \cdots p(\mathbf{x}_{T-1} | \mathbf{x}_{0:T-2}) p(\mathbf{x}_{T} | \mathbf{x}_{0:T-1}) \\ &= p(\mathbf{x}_{0}) p(\mathbf{x}_{1} | \mathbf{x}_{0}) p(\mathbf{x}_{2} | \mathbf{x}_{1}) \cdots p(\mathbf{x}_{T-1} | \mathbf{x}_{T-2}) p(\mathbf{x}_{T} | \mathbf{x}_{T-1}) \\ &= p(\mathbf{x}_{0}) \prod\limits_{t=1}^{T} p(\mathbf{x}_{t} | \mathbf{x}_{t-1}) \end{align*} \tag{2}$

2 Background

Section 2 briefly introduces how the conceptual idea of the diffusion model can be mathematically expressed. Since the foundational idea comes from other papers, the explanation is not in-depth, making it difficult for beginners due to omitted details. I’ll elaborate on these as thoroughly as possible below.

Diffusion Process (Forward Process)

Let’s mathematically outline the diffusion process. An image $\mathbf{x}_{t} \in \mathbb{R}^{D}$ at step $t$ is $\mathbf{x}_{t-1} \in \mathbb{R}^{D}$ from step $t-1$ with Gaussian noise $\sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \sim \mathcal{N}(\mathbf{0}, \beta_{t}\mathbf{I})$ added. $\beta_{t}$ determines the extent of diffusion of the noise and can be expressed as follows:

$\mathbf{x}_{t} = \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t}, \qquad \boldsymbol{\epsilon}_{t} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$

The reason the coefficient is $(\sqrt{1-\beta_{t}}, \sqrt{\beta_{t}})$ rather than $(1-\beta_{t}, \beta_{t})$ is due to the properties of variance $\Var(aX) = a^{2}\Var(X)$ . To ensure all $n \ge 1$ approximately follow a standard normal distribution when $t$ is sufficiently large, the coefficient must include $\sqrt{}$ . Assuming $t$ is sufficiently large, $\mathbf{x}_{t}$ will follow a standard normal distribution. At that time, the covariance matrix of $\mathbf{x}_{t+1}$ is maintained as shown below (still following a standard normal distribution). As $\mathbf{x}_{t-1}$ and $\boldsymbol{\epsilon}_{t}$ are independent, the following holds:

$\begin{align*} \Cov(\mathbf{x}_{t}) &= \Cov(\sqrt{1-\beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t}) \\ &= (1-\beta_{t})\Cov(\mathbf{x}_{t-1}) + \beta_{t}\Cov(\boldsymbol{\epsilon}_{t}) \\ &= (1-\beta_{t})\mathbf{I} + \beta_{t}\mathbf{I} \\ &= \mathbf{I} \end{align*}$

When $\mathbf{x}_{t-1}$ is fixed (as a constant), the conditional probability density function for $\mathbf{x}_{t}$ is as follows:

$q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) = \mathcal{N}(\sqrt{1 - \beta_{t}}\mathbf{x}_{t-1}, \beta_{t}\mathbf{I}) \tag{3}$

(The notation $\mathcal{N}(\mathbf{x}_{t}; \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1}, \beta_{t}\mathbf{I})$ was used in the paper to explicitly mention that it represents the distribution for $\mathbf{x}_{t}$ .) Each of the following verifies that the mean vector and covariance matrix are as stated. Assume $\mathbf{x}_{t-1}$ is fixed (i.e., it is constant).

$\begin{align*} \mathbb{E} [\mathbf{x}_{t} | \mathbf{x}_{t-1}] &= \mathbb{E} \left[ \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \right] \qquad (\mathbf{x}_{t-1} \text{ is constant vector})\\ &= \mathbb{E} \left[ \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} \right] + \mathbb{E} \left[ \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \right] \\ &= \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\mathbb{E} \left[ \boldsymbol{\epsilon}_{t} \right] \\ &= \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} \end{align*}$

$\begin{align*} \Cov (\mathbf{x}_{t} | \mathbf{x}_{t-1}) &= \Cov \left( \sqrt{1 - \beta_{t}}\mathbf{x}_{t-1} + \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \right) \qquad (\mathbf{x}_{t-1} \text{ is constant vector})\\ &= \Cov \left( \sqrt{\beta_{t}}\boldsymbol{\epsilon}_{t} \right) \\ &= \beta_{t}\Cov \left( \boldsymbol{\epsilon}_{t} \right) \\ &= \beta_{t} \mathbf{I} \end{align*}$

Conditional to a given image $\mathbf{x}_{0}$ , when Gaussian noise is repetitively added, the deteriorated image is $\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{T}$ (notice the same notations for this random variable). The probability density function for this entire process is given through the conditional joint probability density function $q(\mathbf{x}_{1:T} | \mathbf{x}_{0})$ , referred to as the forward process or diffusion process. As $\mathbf{x}_{0:T}$ is Markov, by $\eqref{2}$ , the diffusion process can be modeled as follows:

$q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) = \prod\limits_{t=1}^{T} q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) \tag{4}$

Meanwhile, one can explicitly specify the conditional probability density function when $\mathbf{x}_{t}$ is determined at any step $t$ from $\mathbf{x}_{0}$ . Suppose it’s $\alpha_{t} = 1 - \beta_{t}$ , then:

$\begin{align*} \mathbf{x}_{t} &= \sqrt{\alpha_{t}}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t} \\ &= \sqrt{\alpha_{t}}\left( \sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}}\boldsymbol{\epsilon}_{t-1} \right) + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t} \\ &= \sqrt{\alpha_{t}}\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{\alpha_{t}}\sqrt{1 - \alpha_{t-1}}\boldsymbol{\epsilon}_{t-1} + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t} \\ \end{align*}$

The sum of two normal distributions can similarly be represented by one normal distribution:

If $X_{1} \sim \mathcal{N}(\mu_{1}, \sigma_{1}^{2})$ and $X_{2} \sim \mathcal{N}(\mu_{2}, \sigma_{2}^{2})$ are independent, $X_{1} + X_{2} \sim \mathcal{N}(\mu_{1} + \mu_{2}, \sigma_{1}^{2} + \sigma_{2}^{2})$ The same logic applies to random vectors.

Therefore, the following holds:

$\begin{align*} \sqrt{\alpha_{t}}\sqrt{1 - \alpha_{t-1}}\boldsymbol{\epsilon}_{t-1} + \sqrt{1 - \alpha_{t}}\boldsymbol{\epsilon}_{t} &\sim \mathcal{N}(\mathbf{0}, \alpha_{t}(1-\alpha_{t-1})\mathbf{I} + (1 - \alpha_{t})\mathbf{I}) \\ &\quad= \mathcal{N}(\mathbf{0}, (1 - \alpha_{t}\alpha_{t-1})\mathbf{I}) \end{align*}$

Thus, $\mathbf{x}_{t}$ is given by:

$\mathbf{x}_{t} = \sqrt{\alpha_{t}\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t}\alpha_{t-1}}\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$

Repeating this process yields the following:

$\begin{align*} \mathbf{x}_{t} &= \sqrt{\alpha_{t}\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t}\alpha_{t-1}}\boldsymbol{\epsilon} \\ &= \sqrt{\alpha_{t}\alpha_{t-1}}(\sqrt{\alpha_{t-2}}\mathbf{x}_{t-3} + \sqrt{1 - \alpha_{t-2}}\boldsymbol{\epsilon}_{t-2}) + \sqrt{1 - \alpha_{t}\alpha_{t-1}}\boldsymbol{\epsilon} \\ &= \sqrt{\alpha_{t}\alpha_{t-1}\alpha_{t-2}}\mathbf{x}_{t-3} + \sqrt{\alpha_{t}\alpha_{t-1}(1 - \alpha_{t-2})}\boldsymbol{\epsilon}_{t-2} + \sqrt{1 - \alpha_{t}\alpha_{t-1}}\boldsymbol{\epsilon} \\ &= \sqrt{\alpha_{t}\alpha_{t-1}\alpha_{t-2}}\mathbf{x}_{t-3} + \sqrt{1 - \alpha_{t}\alpha_{t-1}\alpha_{t-2}}\boldsymbol{\epsilon} \\ &\vdots \\ &= \sqrt{\prod\limits_{s=1}^{t} \alpha_{s}}\mathbf{x}_{0} + \sqrt{1 - \prod\limits_{s=1}^{t} \alpha_{s}}\boldsymbol{\epsilon} \\ &= \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} + \sqrt{1 - \overline{\alpha}_{t}}\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), \quad \overline{\alpha}_{t} = \prod\limits_{s=1}^{t} \alpha_{s} \tag{5} \end{align*}$

Hence, given $\mathbf{x}_{0}$ , the conditional joint density function concerning $\mathbf{x}_{t}$ is given as follows:

$q(\mathbf{x}_{t} | \mathbf{x}_{0}) = \mathcal{N}(\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0}, (1 - \overline{\alpha}_{t})\mathbf{I}) \tag{6}$

If $\alpha_{t}$ is sufficiently small (i.e., $\beta_{t}$ is close to $1$ ) and $T$ is sufficiently large, $\overline{\alpha}_{T} = \prod_{s=1}^{T} \alpha_{s}$ closely approximates $0$ , and thus, the following holds:

$q(\mathbf{x}_{T} | \mathbf{x}_{0}) \to \mathcal{N}(\mathbf{0}, \mathbf{I}) \quad \text{as } T \to \infty$

These equations support the idea that $\mathbf{x}_{T}$ , which arises by continuously adding Gaussian noise to $\mathbf{x}_{0}$ , follows a standard normal distribution.

Denoising Process (Reverse Process)

Now consider the reverse process (denoising) that retrieves $\mathbf{x}_{0}$ by gradually removing noise from Gaussian noise $\mathbf{x}_{T} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ . According to Feller⁴, the reverse process of forward diffusion exactly mirrors the forward process for distributions like normal or binomial, given sufficiently small diffusion coefficients $\beta$ .

“For both Gaussian and binomial diffusion, for continuous diffusion (limit of small step size $\beta$ ) the reversal of the diffusion process has the identical functional form as the forward process.”
– Sohl-Dickstein, Jascha, et al. Deep unsupervised learning using nonequilibrium thermodynamics. International conference on machine learning. pmlr, 2015.

This implies that the probability density function of denoising, denoted $p$ , satisfies the following. Given that $q(\mathbf{x}_{t} | \mathbf{x}_{t-1})$ follows a normal distribution, its reverse process is as follows:

$p(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}(\mathbf{\mu}(\mathbf{x}_{t}, t), \boldsymbol{\Sigma}(\mathbf{x}_{t}, t))$

The reverse process is also a Markov chain, allowing the entire process of noise removal to be expressed as follows when starting from a standard normal noise $\mathbf{x}_{T}$ to result in a specific image $\mathbf{x}_{0}$ :

$p(\mathbf{x}_{0:T}) = p(\mathbf{x}_{T}) \prod\limits_{t=1}^{T} p(\mathbf{x}_{t-1} | \mathbf{x}_{t})$

Unlike the diffusion process $q(\mathbf{x}_{0:T})$ , the denoising process $p(\mathbf{x}_{0:T})$ is unknown and needs to be estimated. Let us denote the approximation function with parameters $\theta$ as $p_{\theta}$ for $p$ . Knowing that the probability density function with respect to $\mathbf{x}_{T}$ is $p(\mathbf{x}_{T}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ , we have:

$p_{\theta} (\mathbf{x}_{0:T}) = p(\mathbf{x}_{T}) \prod\limits_{t=1}^{T} p_{\theta} (\mathbf{x}_{t-1} | \mathbf{x}_{t}) \tag{7}$

While we know that each $p(\mathbf{x}_{t-1} | \mathbf{x}_{t})$ follows a normal distribution, its mean vector and covariance matrix are unknown, and therefore also need estimation. Consequently, $p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})$ can be represented as follows:

$p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}(\mu_{\theta}(\mathbf{x}_{t}, t), \Sigma_{\theta}(\mathbf{x}_{t}, t))$

Loss Function

To generate new data, one seeks the probability distribution $p(\mathbf{x}_{0})$ with respect to the data $\mathbf{x}_{0}$ . Knowing this allows for generating new data through random sampling that follows the same distribution as existing data. The approximating function of $p(\mathbf{x}_{0})$ is $p_{\theta}(\mathbf{x}_{0})$ with parameters $\theta$ . To make $p_{\theta}(\mathbf{x}_{0})$ similar to $p(\mathbf{x}_{0})$ , one considers $p_{\theta}(\mathbf{x}_{0})$ as a likelihood and employs the maximum likelihood estimation method. Since $p$ is assumed normal, leverage log-likelihood for mathematical simplification; this naturally bridges with relative entropy (Kullback-Leibler divergence). Maximizing $\log p_{\theta}(\mathbf{x}_{0})$ is tantamount to minimizing $-\log p_{\theta}(\mathbf{x}_{0})$ , which is our starting point (gradient descent will be used). Hence, the aim is to find $\theta$ that minimizes $\mathbb{E}_{q(\mathbf{x}_{0})}[-\log p_{\theta}(\mathbf{x}_{0})]$ . By the definition of marginal/conditional probability density functions, $p(\mathbf{x})$ is expressed as follows:

$\begin{align*} p_{\theta}(\mathbf{x}_{0}) &= \int p_{\theta}(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T} \\ &= \int p(\mathbf{x}_{0:T}) \dfrac{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} d\mathbf{x}_{1:T} \\ &= \int q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) \dfrac{ p(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} d\mathbf{x}_{1:T} \\ &= \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \left[ \dfrac{ p(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right] \end{align*}$

We can breakdown the expectation we intend to minimize as follows. By Jensen’s inequality, $-\log (\mathbb{E}[X]) \le \mathbb{E}[-\log X]$ implies:

$\begin{align*} \mathbb{E}_{q(\mathbf{x}_{0})}[-\log p_{\theta}(\mathbf{x}_{0})] &= \mathbb{E}_{q(\mathbf{x}_{0})} \left[ -\log \left( \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \left[ \dfrac{ p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right] \right) \right] \\ &\le \mathbb{E}_{q(\mathbf{x}_{0})} \left[ \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})}\left( -\log \dfrac{ p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ -\log \dfrac{ p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right] \\ \end{align*}$

The final equality holds owing to the definition of conditional probability density functions $q(X,Y) = q(X)q(Y|X)$ . Envisioning the two forms of expectations as integral representations might offer clearer insights. Substituting $(4)$ and $(7)$ into the equation yields:

$\begin{align*} \mathbb{E}_{q(\mathbf{x}_{0})}[-\log p_{\theta}(\mathbf{x}_{0})] &\le \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ -\log \dfrac{ p_{\theta}(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ -\log \dfrac{ p(\mathbf{x}_{T}) \prod\limits_{t=1}^{T} p_{\theta} (\mathbf{x}_{t-1} | \mathbf{x}_{t})}{\prod\limits_{t=1}^{T} q(\mathbf{x}_{t} | \mathbf{x}_{t-1})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ -\log p(\mathbf{x}_{T}) - \sum\limits_{t=1}^{T} \log \dfrac{ p_{\theta} (\mathbf{x}_{t-1} | \mathbf{x}_{t})}{ q(\mathbf{x}_{t} | \mathbf{x}_{t-1})} \right] =: L \end{align*}$

The last equality is valid due to the properties of logarithms. While the left term is our actual minimization goal, it is uncomputable due to the unknown distribution $q(\mathbf{x}_{0})$ of real data. But examining the right-hand side reveals that it involves computable terms, namely $p(\mathbf{x}_{T})$ and $q(\mathbf{x}_{t} | \mathbf{x}_{t-1})$ ; the first, being assumed normal, is a known quantity. The second is derived through actual execution of the diffusion process, thus by minimizing computable quantities on the right, the inequality implies an indirect minimization of the left.

Through algebraic manipulation, the loss function can be tailored for computational efficiency. Presently, the inclusion of $\log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t} | \mathbf{x}_{t-1})}$ in $L$ necessitates sampling for $p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})$ , introducing variance in real data that destabilizes the learning process. Let’s amend the formula within Kullback-Leibler divergence (KLD) for representation. Assuming normal distributions for $p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})$ and $q(\mathbf{x}_{t} | \mathbf{x}_{t-1})$ yields a KLD closed-form solution by the formula below, allowing exact loss computation sans variance, yielding robust and efficient learning.

Relative Entropy of Normal Distributions:
The relative entropy between two multivariate normal distributions $N(\boldsymbol{\mu}, \Sigma)$ and $N(\boldsymbol{\mu_{1}}, \Sigma_{1})$ is given by the following. With $\boldsymbol{\mu}, \boldsymbol{\mu}_{1} \in \mathbb{R}^{D}$ :
$\begin{array}{l} D_{\text{KL}}\big( N(\boldsymbol{\mu}, \Sigma) \| N(\boldsymbol{\mu_{1}}, \Sigma_{1}) \big) \\[1em] = \dfrac{1}{2} \left[ \log \left( \dfrac{|\Sigma|}{|\Sigma_{1}|} \right) + \Tr(\Sigma_{1}^{-1}\Sigma) + (\boldsymbol{\mu} - \boldsymbol{\mu_{1}})^{\mathsf{T}} \Sigma_{1}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu_{1}}) - D \right] \end{array}$

$\begin{align*} L &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=1}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t} | \mathbf{x}_{t-1})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t} | \mathbf{x}_{t-1})} - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right] \end{align*}$

Examining the numerator of the second term, the following simplification is possible:

$\begin{align*} q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) &= q(\mathbf{x}_{t} | \mathbf{x}_{t-1}, \mathbf{x}_{0}) &\text{by Markov property} \\ &= \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{t-1}, \mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0})} &\text{by def. of conditional pdf} \\ &= \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) q(\mathbf{x}_{t}, \mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0})} &\text{by def. of conditional pdf} \\ &= q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0})} & \\ &= q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{0}) p(\mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0}) p(\mathbf{x}_{0})} & \\ &= q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{0}) }{ p(\mathbf{x}_{0})} \dfrac{p(\mathbf{x}_{0})}{q(\mathbf{x}_{t-1}, \mathbf{x}_{0})} & \\ &= q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \dfrac{q(\mathbf{x}_{t}| \mathbf{x}_{0})}{q(\mathbf{x}_{t-1}| \mathbf{x}_{0})} &\text{by def. of conditional pdf} \end{align*}$

Substituting gives the formulation:

$\begin{align*} L &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \left( \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}\dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \right) - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} - \sum\limits_{t=2}^{T} \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} \right. \\ &\qquad\qquad\qquad \left. -\log \left( \dfrac{q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} \cdot \dfrac{q(\mathbf{x}_{T-2} | \mathbf{x}_{0})}{q(\mathbf{x}_{T-1} | \mathbf{x}_{0})} \cdots \dfrac{q(\mathbf{x}_{1} | \mathbf{x}_{0})}{q(\mathbf{x}_{2} | \mathbf{x}_{0})} \right) - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} \right. \\ &\qquad\qquad\qquad \left. - \log \left( \dfrac{q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} \cdot \dfrac{q(\mathbf{x}_{T-2} | \mathbf{x}_{0})}{q(\mathbf{x}_{T-1} | \mathbf{x}_{0})} \cdots \dfrac{q(\mathbf{x}_{1} | \mathbf{x}_{0})}{q(\mathbf{x}_{2} | \mathbf{x}_{0})} \cdot \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{1} | \mathbf{x}_{0})} \right) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} \right. \\ &\qquad\qquad\qquad \left. - \log \left( \dfrac{\color{red}{\cancel{\color{black}q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}}}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} \cdot \dfrac{\color{green}{\bcancel{\color{black}q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}}}{\color{red}{\cancel{\color{black}q(\mathbf{x}_{T-1} | \mathbf{x}_{0})}}} \cdots \dfrac{\color{purple}{\cancel{\color{black}q(\mathbf{x}_{1} | \mathbf{x}_{0})}}}{\color{green}{\bcancel{\color{black}q(\mathbf{x}_{2} | \mathbf{x}_{0})}}} \cdot \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{\color{purple}{\cancel{\color{black}q(\mathbf{x}_{1} | \mathbf{x}_{0})}}} \right) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log p(\mathbf{x}_{T}) - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} - \log \dfrac{p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ - \log \dfrac{p(\mathbf{x}_{T})}{q(\mathbf{x}_{T} | \mathbf{x}_{0})} - \sum\limits_{t=2}^{T} \log \dfrac{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})}{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})} - \log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} + \sum\limits_{t=2}^{T} \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} - \log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) \right] \\ (8) &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} \right] + \sum\limits_{t=2}^{T} \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} \right] - \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) \right] \tag{8} \end{align*}$

The first schema of $(8)$ can be reformulated using KLD, as shown below.

$\begin{align*} &\mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} \right] \\ &= \int q(\mathbf{x}_{0:T}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{0:T} \\ &= \int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{0:T} &\text{by def. of conditional pdf} \\ &= \int\int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{1:T-1} d\mathbf{x}_{T} d\mathbf{x}_{0} \\ &= \int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{T} d\mathbf{x}_{0} &\text{by def. of marginal pdf} \\ &= \mathbb{E}_{q(\mathbf{x}_{0})}\left[ \int q(\mathbf{x}_{T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{T} | \mathbf{x}_{0})}{p(\mathbf{x}_{T})} d\mathbf{x}_{T} \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{T} | \mathbf{x}_{0}) \| p(\mathbf{x}_{T}) \right) \Big] \end{align*}$

Note that the paper casually writes this as $\mathbb{E}_{q(\mathbf{x}_{0:T})} = \mathbb{E}_{q} = \mathbb{E}_{q(\mathbf{x}_{0})}$ . While the expected value inside solely depends on $\mathbf{x}_{0}$ , resulting in the same outcome even with unusal integrals on arbitrary variables, this prompts meticulous reader awareness due to absent explanation in the paper.

Now, considering the second term in $(8)$ . Similar to the first term, we seek expressions representing $q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})$ with $q(\mathbf{x}_{0:T})$ reordering to employ a KLD form.

$\begin{align*} &\sum\limits_{t=2}^{T} \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} \right] \\ &= \sum\limits_{t=2}^{T} \int q(\mathbf{x}_{0:T}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{0:T} \\ &= \sum\limits_{t=2}^{T} \int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:T} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{0:T} \\ &= \sum\limits_{t=2}^{T} \int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:t-1}, \mathbf{x}_{t+1:T} | \mathbf{x}_{t}, \mathbf{x}_{0}) q(\mathbf{x}_{t} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{0:T} \\ &= \sum\limits_{t=2}^{T} \int\int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{1:t-1}, \mathbf{x}_{t+1:T} | \mathbf{x}_{t}, \mathbf{x}_{0}) q(\mathbf{x}_{t} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{(1:t-2,t+1:T)} d\mathbf{x}_{t-1:t} d\mathbf{x}_{0} \\ &= \sum\limits_{t=2}^{T} \int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{t-1}| \mathbf{x}_{t}, \mathbf{x}_{0}) q(\mathbf{x}_{t} | \mathbf{x}_{0}) \log \dfrac{q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})}{p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})} d\mathbf{x}_{t-1:t} d\mathbf{x}_{0} \\ &= \sum\limits_{t=2}^{T} \int\int q(\mathbf{x}_{0}) q(\mathbf{x}_{t} | \mathbf{x}_{0}) D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) d\mathbf{x}_{t} d\mathbf{x}_{0} \\ &= \int q(\mathbf{x}_{0}) \left[ \sum\limits_{t=2}^{T} \int q(\mathbf{x}_{t} | \mathbf{x}_{0}) D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) d\mathbf{x}_{t} \right] d\mathbf{x}_{0} \\ &= \mathbb{E}_{q(\mathbf{x}_{0})} \left[ \sum\limits_{t=2}^{T} \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \Big] \right] \\ \end{align*}$

Once again, beware of the notation overuse where it’s written as $\displaystyle \mathbb{E}_{q} \left[ \sum\limits_{t=2}^{T} D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \right]$ in the paper. Utilizing superfluous integral adjustments on non-contributing variables reflects precisely the paper’s notation.

Finally, the entire schema of $(8)$ is reconstructed to denote KLD forms:

$\begin{align*} &L = \mathbb{E}_{q(\mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{T} | \mathbf{x}_{0}) \| p(\mathbf{x}_{T}) \right) \Big] \\ &\qquad +\mathbb{E}_{q(\mathbf{x}_{0})} \left[ \sum\limits_{t=2}^{T} \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \Big] \right] - \mathbb{E}_{q(\mathbf{x}_{0:T})} \left[ \log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) \right] \end{align*}$

An encompassing term construction involving integrational trickery allows for the expression as:

$\mathbb{E}_{q(\mathbf{x}_{0:T})} \bigg[ \underbrace{D_{\text{KL}} \left( q(\mathbf{x}_{T} | \mathbf{x}_{0}) \| p(\mathbf{x}_{T}) \right)}_{L_{T}} + \sum\limits_{t=2}^{T} \underbrace{\Big[ D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \Big]}_{L_{t-1}} - \underbrace{\log p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1})}_{L_{0}} \bigg]$

The derivation for $q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})$ can be executed like this:

Properties of Conditional Probability
$p(\mathbf{x} | \mathbf{y}, \mathbf{z}) = \dfrac{p(\mathbf{x}, \mathbf{y} | \mathbf{z}) }{p(\mathbf{y} | \mathbf{z})}$

$\begin{align*} q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) &= \dfrac{q(\mathbf{x}_{t}, \mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \\ &= \dfrac{q(\mathbf{x}_{t} | \mathbf{x}_{t-1}, \mathbf{x}_{0}) q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \\ &= \dfrac{q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \end{align*}$

The final equality is valid due to the Markov assumption of $\left\{ \mathbf{x}_{t} \right\}$ . From earlier computations concerning probability density functions $(3)$ , $(6)$ , substitution directly follows.

$\begin{align*} &q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \\ &= \dfrac{q(\mathbf{x}_{t} | \mathbf{x}_{t-1}) q(\mathbf{x}_{t-1} | \mathbf{x}_{0})}{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \\ &= C \dfrac{\exp\left( - \dfrac{1}{2} \dfrac{\left( \mathbf{x}_{t} - \sqrt{\alpha_{t}}\mathbf{x}_{t-1} \right)^{2}}{(1-\alpha_{t})} \right) \exp\left( - \dfrac{1}{2} \dfrac{\left( \mathbf{x}_{t-1} - \sqrt{\overline{\alpha}_{t-1}}\mathbf{x}_{0} \right)^{2}}{(1-\overline{\alpha}_{t-1})} \right)}{\exp\left( - \dfrac{1}{2} \dfrac{\left( \mathbf{x}_{t} - \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} \right)^{2}}{(1-\overline{\alpha}_{t})} \right)} \\ &= C \exp\left[ -\dfrac{1}{2} \left( \dfrac{1}{1-\alpha_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} - \dfrac{2\sqrt{\alpha_{t}}}{1-\alpha_{t}} \mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t-1} + \dfrac{\alpha_{t}}{1-\alpha_{t}} \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} \right.\right.\\ &\qquad\qquad\qquad \quad + \dfrac{1}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - \dfrac{2\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{t-1} + \dfrac{\overline{\alpha_{t-1}}}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0} \\ &\qquad\qquad\qquad\quad \left.\left. - \dfrac{1}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + \dfrac{2\sqrt{\overline{\alpha}_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} - \dfrac{\overline{\alpha_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0} \right)\right] \tag{9} \end{align*}$

Here, $C = \dfrac{1}{\sqrt{ \left(2\pi \dfrac{(1-\alpha_{t})(1-\overline{\alpha_{t}})}{(1-\overline{\alpha_{t}})} \right)^{D} }}$ is constant throughout. As we aim to find $q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})$ , rearrange the exponent related to $\mathbf{x}_{t-1}$ :

$\begin{align*} &\left( \dfrac{\alpha_{t}}{1-\alpha_{t}} + \dfrac{1}{1-\overline{\alpha}_{t-1}} \right) \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - 2\left( \dfrac{\sqrt{\alpha_{t}}}{1-\alpha_{t}} \mathbf{x}_{t}^{\mathsf{T}} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{0}^{\mathsf{T}} \right)\mathbf{x}_{t-1} \\ &\qquad + \left[ \left( \dfrac{1}{1-\alpha_{t}} - \dfrac{1}{1-\overline{\alpha}_{t}} \right)\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\dfrac{\sqrt{\overline{\alpha}_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \left( \dfrac{\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t-1}} - \dfrac{\overline{\alpha}_{t}}{1-\overline{\alpha}_{t}} \right)\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \end{align*}$

Combine constants from each term through common denominators:

$\left( \dfrac{\alpha_{t}}{1-\alpha_{t}} + \dfrac{1}{1-\overline{\alpha}_{t-1}} \right) = \dfrac{\alpha_{t} - \alpha_{t}\overline{\alpha}_{t-1} + 1 - \alpha_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} = \dfrac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}$
$\left( \dfrac{1}{1-\alpha_{t}} - \dfrac{1}{1-\overline{\alpha}_{t}} \right) = \dfrac{\alpha_{t} - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t})} = \dfrac{\alpha_{t}(1-\overline{\alpha}_{t-1})}{(1-\alpha_{t})(1-\overline{\alpha}_{t})}$
$\left( \dfrac{\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t-1}} - \dfrac{\overline{\alpha}_{t}}{1-\overline{\alpha}_{t}} \right) = \dfrac{\overline{\alpha}_{t-1} - \overline{\alpha}_{t}}{(1-\overline{\alpha}_{t-1})(1-\overline{\alpha}_{t})} = \dfrac{\overline{\alpha}_{t-1}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t-1})(1-\overline{\alpha}_{t})}$

Next, arrange the terms under:

$\begin{align*} & \dfrac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - 2\left( \dfrac{\sqrt{\alpha_{t}}}{1-\alpha_{t}} \mathbf{x}_{t}^{\mathsf{T}} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t-1}}\mathbf{x}_{0}^{\mathsf{T}} \right)\mathbf{x}_{t-1} \\ &\qquad + \left[ \dfrac{\alpha_{t}(1-\overline{\alpha}_{t-1})}{(1-\alpha_{t})(1-\overline{\alpha}_{t})}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\dfrac{\sqrt{\overline{\alpha}_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \dfrac{\overline{\alpha}_{t-1}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t-1})(1-\overline{\alpha}_{t})}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \end{align*}$

Combining constants under $\mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1}$ gives another structuring stride:

$\begin{align*} & \frac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} \left( \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - 2 \left( \frac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}}}{1 - \overline{\alpha}_{t}} \mathbf{x}_{t}^{\mathsf{T}} + \frac{(1-\alpha_{t})\overline{\alpha}_{t-1}}{1 - \overline{\alpha}_{t}}\mathbf{x}_{0}^{\mathsf{T}} \right)\mathbf{x}_{t-1} \right. \\ &\qquad +\left. \left[ \frac{\alpha_{t}(1-\overline{\alpha}_{t-1})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\frac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})\sqrt{\overline{\alpha}_{t}}}{(1 - \overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \frac{\overline{\alpha}_{t-1}(1 - \alpha_{t})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \right) \end{align*} \tag{10}$

Recasting the expression yields:

$\begin{align*} &\left[ \dfrac{\alpha_{t}(1-\overline{\alpha}_{t-1})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\dfrac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}\overline{\alpha}_{t-1}}}{(1 - \overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \dfrac{\overline{\alpha}_{t-1}(1 - \alpha_{t})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \\ &=\left[ \dfrac{\sqrt{\alpha_{t}}^{2}(1-\overline{\alpha}_{t-1})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{t} + 2\dfrac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}\overline{\alpha}_{t-1}}}{(1 - \overline{\alpha}_{t})^{2}}\mathbf{x}_{t}^{\mathsf{T}}\mathbf{x}_{0} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}^{2}(1 - \alpha_{t})^{2}}{(1-\overline{\alpha}_{t})^{2}}\mathbf{x}_{0}^{\mathsf{T}}\mathbf{x}_{0}\right] \\ &=\left[ \dfrac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}\right]^{2} \\ \end{align*}$

Plugging this into $(10)$ returns:

$\begin{align*} & \dfrac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} \left( \mathbf{x}_{t-1}^{\mathsf{T}}\mathbf{x}_{t-1} - 2 \left( \dfrac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}}}{1 - \overline{\alpha}_{t}} \mathbf{x}_{t}^{\mathsf{T}} + \dfrac{(1-\alpha_{t})\overline{\alpha}_{t-1}}{1 - \overline{\alpha}_{t}}\mathbf{x}_{0}^{\mathsf{T}} \right)\mathbf{x}_{t-1} \right. \\ &\qquad \left. +\left[ \dfrac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}\right]^{2} \right) \\ &= \dfrac{1 - \overline{\alpha}_{t}}{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})} \left( \mathbf{x}_{t} - \left[ \dfrac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}\right] \right)^{2} \end{align*}$

Reintegrating this with $(9)$ gives:

$\begin{array}{l} q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \\ = \frac{1}{\sqrt{ \left(2\pi \frac{(1-\alpha_{t})(1-\overline{\alpha_{t}})}{(1-\overline{\alpha_{t}})} \right)^{D} }} \exp\left[ -\dfrac{1}{2} \dfrac{\left( \mathbf{x}_{t} - \left[ \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}\right] \right)^{2}}{\frac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}{(1 - \overline{\alpha}_{t})}} \right] \end{array}$

This bears resemblance to:

$\begin{align*} q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) &= \mathcal{N} \left( \dfrac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \dfrac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0}, \dfrac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}{(1 - \overline{\alpha}_{t})}\mathbf{I} \right) \\ &= \mathcal{N} ( \tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}), \tilde{\beta}_{t} \mathbf{I}) \end{align*}$

$\text{where}\qquad \tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) = \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0},$

$\tilde{\beta}_{t} = \frac{(1-\overline{\alpha}_{t-1}) \beta_{t}}{(1 - \overline{\alpha}_{t})}$

3 Diffusion Models and Denoising Autoencoders

3.1 Forward Process and $L_{T}$

The paper fixes $\beta_{t}$ as a constant instead of treating it as a learnable parameter. Hence, the expression for $L_{T}$ lacks trainable parameters, allowing its omission during loss function implementation.

$\begin{align*} L_{T} &= D_{\text{KL}} \left( q(\mathbf{x}_{T} | \mathbf{x}_{0}) \| p(\mathbf{x}_{T}) \right) \\ &= D_{\text{KL}} \Big[ \mathcal{N} \left( \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0}, (1-\overline{\alpha}_{t}) \mathbf{I} \right) \| \mathcal{N} \left( 0, \mathbf{I} \right) \Big] \\ &= \dfrac{1}{2} \left[ \log (1-\overline{\alpha}_{t})^{D} + D(1-\overline{\alpha}_{t}) + \overline{\alpha}_{t}\mathbf{x}_{0}^{2} - D \right] \end{align*}$

3.2 Reverse Process and $L_{1:T-1}$

The authors stipulated the covariance matrix to be $\boldsymbol{\Sigma}_{\theta} (\mathbf{x}_{t}, t) = \sigma_{t}^{2} \mathbf{I}$ by default for $p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}(\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t), \boldsymbol{\Sigma}_{\theta} (\mathbf{x}_{t}, t))$ without learnable parameters. They found experimental outcomes were comparable in two distinct setups.

$\sigma_{t}^{2} = \beta_{t} \qquad \text{or} \qquad \sigma_{t}^{2} = \tilde{\beta}_{t} = \dfrac{1 - \overline{\alpha}_{t-1}}{1 - \overline{\alpha}_{t}} \beta_{t}$

The left configuration, where $\mathbf{x}_{0} \sim \mathcal{N} (\mathbf{0}, \mathbf{I})$ extracted values optimise learning datasets, while, a single fixed $\mathbf{x}_{0} \sim \mathcal{N} (\mathbf{x}_{0}, \mathbf{I})$ excels in right-side setups.

The structure of loss function $L_{t-1}$ manifests as:

$\begin{align*} L_{t-1} &= \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}) \| p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \right) \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \Big[ D_{\text{KL}} \left( \mathcal{N}( \tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}), \tilde{\beta}_{t} \mathbf{I}) \| \mathcal{N}(\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t), \sigma_{t}^{2} \mathbf{I}) \right) \Big] \\ &= \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \left[ \dfrac{1}{2} \left( \log \left( \dfrac{\tilde{\beta}_{t}}{\sigma_{t}^{2}} \right)^{D} + D\dfrac{\tilde{\beta}_{t}}{\sigma_{t}^{2}} + \dfrac{(\tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) - \boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t))^{2}}{\sigma_{t}^{2}} - D \right) \right] \\ &= \mathbb{E}_{q(\mathbf{x}_{t} | \mathbf{x}_{0})} \left[ \dfrac{1}{2\sigma_{t}^{2}} (\tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) - \boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t))^{2}\right] + C_{2} \end{align*}$

Here, $C_{2}$ bears no parameter $\theta$ reliance. Therefore, $\boldsymbol{\mu}_{\theta}$ is modeled to predict $\tilde{\boldsymbol{\mu}}_{t}$ . Explicitly unfold $\tilde{\boldsymbol{\mu}}_{t}$ , sharpening the learning target:

From $(5)$ , relate $\mathbf{x}_{t}$ with $\mathbf{x}_{0}$ as $\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) = \sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} + \sqrt{1-\overline{\alpha}_{t}}\boldsymbol{\epsilon}$ through:

$\begin{align*} \tilde{\boldsymbol{\mu}}_{t}(\mathbf{x}_{t}, \mathbf{x}_{0}) &= \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t} + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{0} \\ &= \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})}\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})} \left( \dfrac{1}{\sqrt{\overline{\alpha}_{t}}} \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \dfrac{\sqrt{1-\overline{\alpha}_{t}}}{\sqrt{\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) \\ &= \left( \frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})}{(1-\overline{\alpha}_{t})} + \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}_{t}}} \right)\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\sqrt{\overline{\alpha}_{t-1}}(1 - \alpha_{t})\sqrt{1-\overline{\alpha}_{t}}}{(1-\overline{\alpha}_{t})\sqrt{\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \\ &= \left( \frac{\alpha_{t}(1-\overline{\alpha}_{t-1})}{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t})} + \frac{(1 - \alpha_{t})}{(1-\overline{\alpha}_{t})\sqrt{\alpha_{t}}} \right)\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{(1 - \alpha_{t})}{\sqrt{1-\overline{\alpha}_{t}}\sqrt{\alpha_{t}}}\boldsymbol{\epsilon} \\ &= \dfrac{1}{\sqrt{\alpha_{t}}}\left( \frac{ (\alpha_{t} - \overline{\alpha}_{t}) + (1 - \alpha_{t})}{(1-\overline{\alpha}_{t})} \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) \\ &= \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) \\ \end{align*}$

Thus, the expression for $L_{t-1}$ rewrites:

$L_{t-1} = \mathbb{E}_{\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon})} \left[ \dfrac{1}{2\sigma_{t}^{2}} \left[ \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) - \boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}), t) \right]^{2} \right] + C_{2} \tag{11}$

Ultimately, the learning target of $\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t)$ is $\dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right)$ . Following the given input $\mathbf{x}_{t}$ and fixed constant $\beta_{t}$ , this underscores the embedding of $\boldsymbol{\epsilon} = \boldsymbol{\epsilon}_{t}$ as the network’s operative target. Parameter $\theta$ being the sole dependency aligns with:

$\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}, t) = \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t} - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right) \tag{12}$

Such presentation is intuitively obvious and structurally coherent, considering $\mathbf{x}_{0}$ restoration through knowable additive noise recovery across $\boldsymbol{\epsilon}_{t}$ . This intuitive portrayal substantiates the logical validity through strictly mathematical demonstration.

At $p(\mathbf{x}_{t-1} | \mathbf{x}_{t}) = \mathcal{N}( \boldsymbol{\mu}_{\theta}, \sigma_{t}^{2}\mathbf{I})$ ’s mean vector, given $\mathbf{x}_{t}$ , sampling $\mathbf{x}_{t-1} \sim p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})$ aligns with:

$\mathbf{x}_{t-1} = \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t} - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right) + \sigma_{t} \mathbf{z},\qquad \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\$

Final integration of $(12)$ into $(11)$ offers:

$\begin{align*} & \mathbb{E}_{\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon})} \left[ \dfrac{1}{2\sigma_{t}^{2}} \left[ \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}) - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) - \boldsymbol{\mu}_{\theta}(\mathbf{x}_{t}(\mathbf{x}_{0}, \boldsymbol{\epsilon}), t) \right]^{2} \right] \\ &= \mathbb{E}_{\mathbf{x}_{t}} \left[ \dfrac{1}{2\sigma_{t}^{2}} \left[ \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t} - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} \right) - \dfrac{1}{\sqrt{\alpha_{t}}}\left( \mathbf{x}_{t} - \frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right) \right]^{2} \right] \\ &= \mathbb{E}_{\mathbf{x}_{t}} \left[ \dfrac{1}{2\sigma_{t}^{2}} \left[ \dfrac{1}{\sqrt{\alpha_{t}}}\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon} - \dfrac{1}{\sqrt{\alpha_{t}}}\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right]^{2} \right] \\ &= \mathbb{E}_{\mathbf{x}_{t}} \left[ \dfrac{\beta_{t}^{2}}{2\sigma_{t}^{2}\alpha_{t} (1 - \overline{\alpha}_{t})} \left[ \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}, t) \right]^{2} \right] \\ &= \mathbb{E}_{\mathbf{x}_{0}, \boldsymbol{\epsilon}} \left[ \dfrac{\beta_{t}^{2}}{2\sigma_{t}^{2}\alpha_{t} (1 - \overline{\alpha}_{t})} \left[ \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} + \sqrt{1-\overline{\alpha}_{t}}\boldsymbol{\epsilon}, t) \right]^{2} \right] \\ \end{align*}$

These precedents succinctly translate training and sampling process into pseudocode.

3.3 Data Scaling, Reverse Process Decoder, and $L_{0}$

All aforementioned discussions were on continuous probability density functions. Given that image data is a discrete variable within the range $\left\{ 0, 1, \dots, 255 \right\}$ , proper scaling ensues. Initial rescaling linearly maps $\left\{ 0, 1, \dots, 255 \right\}$ values to $[-1, 1]$ . Ultimately, authors placed sampling’s final stage $p_{\theta}(\mathbf{x}_{0}, \mathbf{x}_{1})$ as:

$p_{\theta}(\mathbf{x}_{0} | \mathbf{x}_{1}) = \prod\limits_{i = 1}^{D} \int\limits_{\delta_{-}(x_{0}^{i})}^{\delta_{+}(x_{0}^{i})} \mathcal{N}(x; \mu_{\theta}^{i}(\mathbf{x}_{1}, 1), \sigma_{1}^{2}) dx$

$\delta_{+}(x) = \begin{cases} \infty & \text{if } x = 1 \\ x + \frac{1}{255} & \text{if } x \lt 1 \end{cases}, \qquad \delta_{-}(x) = \begin{cases} x - \frac{1}{255} & \text{if } x \gt -1 \\ -\infty & \text{if } x = -1 \end{cases}$

Denoting $x_{0}^{i}$ refers to the $i$ th pixel of $\mathbf{x}_{0}$ . Meaning a pixel value $x$ is interpreted as within the range of $[x - \frac{1}{255}, x + \frac{1}{255}]$ .

3.4 Simplified Training Objective

In Section 3.2, we derived $L_{t-1}$ to predict $\boldsymbol{\epsilon}$ but found dropping preceding factors helped in practical application for both performance and implementation:

$L_{\text{simple}}(\theta) := \mathbb{E}_{t, \mathbf{x}_{0}, \boldsymbol{\epsilon}} \left[ \left( \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0} + \sqrt{1-\overline{\alpha}_{t}}\boldsymbol{\epsilon}, t) \right)^{2} \right]$

From the pseudocode, note $t$ follows a uniform distribution drawn between $1$ and $T$ .

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851. ↩︎
https://x.com/cyxacxa https://x.com/cyxacxa/status/1903757493987389938/photo/1 ↩︎
Sohl-Dickstein, Jascha, et al. “Deep unsupervised learning using nonequilibrium thermodynamics.” International conference on machine learning. pmlr, 2015. ↩︎
Feller, William. “Retracted chapter: On the theory of stochastic processes, with particular reference to applications.” Selected Papers I. Springer, Cham, 2015. 769-798. ↩︎