Paper Review: Score Matching
Overview
Score Matching is a statistical technique introduced in the 2005 paper by Aapo Hyvarinen, Estimation of Non-Normalized Statistical Models by Score Matching, which provides a method for estimating non-normalized models without considering the normalization constant.
1. Introduction
In many cases, probabilistic models are given as non-normalized models containing a normalization constant $Z$. For instance, a probability density function $p_{\boldsymbol{\theta}}$ with parameters $\boldsymbol{\theta}$ is defined as follows:
$$ p(\boldsymbol{\xi}; \boldsymbol{\theta}) = \dfrac{1}{Z(\boldsymbol{\theta})} q(\boldsymbol{\xi}; \boldsymbol{\theta}) $$
Here, the issue encountered is that while $q$ is analytically well-defined or easy to compute, $Z(\boldsymbol{\theta}) = \int q(\boldsymbol{\xi}; \boldsymbol{\theta}) d \boldsymbol{\xi}$ is often difficult to calculate. Particularly when $\boldsymbol{\theta}$ is a high-dimensional vector, it can be practically impossible to compute, including problems such as the curse of dimensionality. Previously, methods like Markov chain Monte Carlo were commonly used for estimating non-normalized models, but these methods are slow, and other methods often underperform.
2. Estimation by Score Matching
The crux of the proposed method is the [score function]. Denoting the score function of a model’s probability density function $p(\boldsymbol{\xi}; \boldsymbol{\theta})$ that approximates the distribution of data as $\psi(\boldsymbol{\xi}; \boldsymbol{\theta})$, it is defined as follows:
$$ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) = \begin{bmatrix} \psi_{1}(\boldsymbol{\xi}; \boldsymbol{\theta}) \\ \vdots \\ \psi_{n}(\boldsymbol{\xi}; \boldsymbol{\theta}) \end{bmatrix} := \begin{bmatrix} \dfrac{\partial \log p(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{1}} \\ \vdots \\ \dfrac{\partial \log p(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{n}} \end{bmatrix} = \nabla_{\boldsymbol{\xi}} \log p(\boldsymbol{\xi}; \boldsymbol{\theta}) $$
In other words, the score function is the gradient of the log-probability density function. The method proposed in the paper allows us to ignore the normalization constant, thus redefining the score function as follows:
$$ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) = \nabla_{\boldsymbol{\xi}} \log q(\boldsymbol{\xi}; \boldsymbol{\theta}) $$
Let’s denote the score function for the actual data distribution $\mathbf{x}$ as follows:
$$ \psi_{\mathbf{x}}( \cdot ) = \nabla_{\boldsymbol{\xi}} \log p_{\mathbf{x}}( \cdot ) $$
The paper sets up the objective function such that the expected value difference between the data’s score and the model’s score decreases, as follows:
$$ \begin{equation} \begin{aligned} J(\boldsymbol{\theta}) &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}) \right\|^{2} \mathrm{d}\boldsymbol{\xi} \\ &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left\| \nabla_{\boldsymbol{\xi}} \log q(\boldsymbol{\xi}; \boldsymbol{\theta}) - \nabla_{\boldsymbol{\xi}} \log p_{\mathbf{x}}(\boldsymbol{\xi}) \right\|^{2} \mathrm{d}\boldsymbol{\xi} \end{aligned} \end{equation} $$
Thus, Score Matching refers to a method for estimating $\boldsymbol{\theta}$ as follows:
$$ \hat{\boldsymbol{\theta}} = \argmin\limits_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) $$
However, if we inspect $(1)$, there is indeed an issue: calculating $\psi_{\mathbf{x}}(\boldsymbol{\xi}) = \nabla_{\boldsymbol{\xi}} \log p_{\mathbf{x}}(\boldsymbol{\xi})$ requires knowing $p_{\mathbf{x}}$. Since $p_{\mathbf{x}}$ is unknown, we approximate it using model $p(\boldsymbol{\xi}; \boldsymbol{\theta})$, but this appears to be contradictory because approximating requires knowing $p_{\mathbf{x}}$. In fact, from the below theorem, $(1)$ can be restructured without $\psi_{\mathbf{x}}$.
Assume the score function $\psi(\boldsymbol{\xi}; \boldsymbol{\theta})$ of model $\textbf{Theorem 1}$ is differentiable. Then $(1)$ can be expressed as:
$$ \begin{align*} J(\boldsymbol{\theta}) &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \sum\limits_{i=1}^{n} \left[ \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) + \dfrac{1}{2} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})^{2} \right] \mathrm{d}\boldsymbol{\xi} + \text{constant} \tag{2} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left[\sum\limits_{i=1}^{n} \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) + \dfrac{1}{2} \Braket{ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi(\boldsymbol{\xi}; \boldsymbol{\theta})} \right] \mathrm{d}\boldsymbol{\xi} + \text{constant} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left[\Delta \log q(\boldsymbol{\xi}; \boldsymbol{\theta}) + \dfrac{1}{2} \| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) \|_{2}^{2} \right] \mathrm{d}\boldsymbol{\xi} + \text{constant} \end{align*} $$
Here, $\text{constant}$ is a constant that does not depend on $\boldsymbol{\theta}$. $\psi_{i}$ represents the $i$-th component of the score, and $\partial_{i} \psi_{i}$ is the partial derivative of the $i$-th component of the score function with respect to the $i$-th variable.
$$ \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) = \dfrac{\partial \log q(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} $$ $$ \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) = \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} = \dfrac{\partial^{2} \log q(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}^{2}} $$
Refer to the Appendix for the proof. The first term inside the brackets is identical to the Laplacian of the probability density function $q$ when factored by $\sum$.
$$ \Delta q = \nabla^{2} \log q = \sum\limits_{i=1}^{n} \dfrac{\partial^{2} \log q}{\partial \xi_{i}^{2}} $$
Since, in reality, only a finite amount of data can be processed, if $T$ samples $\mathbf{x}(1), \dots, \mathbf{x}(T)$ are given, the expression for the samples is as follows:
$$ \tilde{J}(\boldsymbol{\theta}) = \dfrac{1}{T} \sum\limits_{t=1}^{T} \sum\limits_{i=1}^{n} \left[ \partial_{i} \psi_{i}(\mathbf{x}(t); \boldsymbol{\theta}) + \dfrac{1}{2} \psi_{i}(\mathbf{x}(t); \boldsymbol{\theta})^{2} \right] + \text{constant} $$
Subsequent to this, from the following theorem, we can see that minimizing $(2)$ is indeed sufficient for model estimation.
Assume $\textbf{Theorem 2}$ exists uniquely such that $p_{\mathbf{x}}(\cdot) = p(\cdot; \boldsymbol{\theta}^{\ast})$ holds. Also assume $q(\boldsymbol{\xi}; \boldsymbol{\theta}) > 0$. Then the following holds:
$$ J(\boldsymbol{\theta}) = 0 \iff \boldsymbol{\theta} = \boldsymbol{\theta}^{\ast} $$
Under the assumptions of the preceding theorems, the score matching estimator obtained by minimizing $\tilde{J}$ is a consistent estimator. That is, as the sample size increases indefinitely, the estimator statistically converges to the true value $\boldsymbol{\theta}^{\ast}$, assuming that the optimization algorithm can find the global minimum.
As the sample size increases, $\tilde{J}$ converges to $J$, Therefore by the law of large numbers, $\text{Corollary}$ holds.
3. Examples
3.1 Multivariate Gaussian Density
Consider a very simple case with a multivariate normal distribution:
$$ p(\mathbf{x}; \mathbf{M}, \boldsymbol{\mu}) = \dfrac{1}{Z(\mathbf{M}, \boldsymbol{\mu} )} \exp \left( -\dfrac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \mathbf{M} (\mathbf{x} - \boldsymbol{\mu}) \right) $$
Here, $\mathbf{M} \in \mathbb{R}^{n \times n}$ is the inverse of the covariance matrix, which is positive definite and a symmetric matrix. $\boldsymbol{\mu} \in \mathbb{R}^{n}$ is the mean vector. Although $Z(\mathbf{M}, \boldsymbol{\mu}) = ((2\pi)^{n} \det \mathbf{M})^{1/2}$ is well-known in this case, let’s consider it as a simple example.
3.1.1 Estimation
In this case, $q$, $\psi$, and $\partial_{i} \psi$ are expressed as follows:
$$ q(\mathbf{x}) = \exp \left( -\dfrac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \mathbf{M} (\mathbf{x} - \boldsymbol{\mu}) \right) $$
For a symmetric matrix $\mathbf{R}$, the gradient of the quadratic form is $\nabla_{\mathbf{x}} (\mathbf{x}^{\mathsf{T}} \mathbf{R} \mathbf{x}) = 2 \mathbf{R} \mathbf{x}$, thus:
$$ \psi(\mathbf{x}; \mathbf{M}, \boldsymbol{\mu}) = -\mathbf{M} (\mathbf{x} - \boldsymbol{\mu}) = - \begin{bmatrix} \sum\limits_{j}m_{1j}(x_{j}-\mu_{j}) \\ \vdots \\[1em] \sum\limits_{j}m_{nj}(x_{j}-\mu_{j})\end{bmatrix} $$
Here, $\mathbf{M} = [m_{ij}]$. $\partial_{i} \psi_{i} = \dfrac{\partial \psi_{i}}{\partial x_{i}}$ is as follows:
$$ \partial_{i} \psi_{i}(\mathbf{x}; \mathbf{M}, \boldsymbol{\mu}) = -m_{ii} $$
Therefore, $\tilde{J}$ is as follows. Given $\sum_{i} \psi_{i} = \braket{\psi, \psi} = \psi^{\mathsf{T}} \psi$:
$$ \begin{align*} \tilde{J}(\mathbf{M}, \boldsymbol{\mu}) &= \dfrac{1}{T} \sum\limits_{t=1}^{T} \left[ \sum\limits_{i} -m_{ii} + \dfrac{1}{2}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}} \mathbf{M}^{\mathsf{T}} \mathbf{M} \left( \mathbf{x}(t) - \boldsymbol{\mu} \right) \right] \\ &= \dfrac{1}{T} \sum\limits_{t=1}^{T} \left[ - \Tr (\mathbf{M}) + \dfrac{1}{2}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}} \mathbf{M}^{\mathsf{T}} \mathbf{M} \left( \mathbf{x}(t) - \boldsymbol{\mu} \right) \right] \end{align*} $$
Here, $\Tr$ represents the trace. To find $\boldsymbol{\mu}$ that minimizes the above equation, we compute the gradient, and by the gradient formula of a quadratic matrix:
$$ \begin{align*} \nabla_{\boldsymbol{\mu}} \tilde{J} &= \dfrac{1}{T} \sum\limits_{t=1}^{T} \left[ \mathbf{M}^{\mathsf{T}} \mathbf{M} \left( \boldsymbol{\mu} - \mathbf{x}(t) \right) \right] \\ &= \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{M}^{\mathsf{T}} \mathbf{M} \boldsymbol{\mu} - \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{M}^{\mathsf{T}} \mathbf{M} \mathbf{x}(t) \\ &= \mathbf{M}^{\mathsf{T}} \mathbf{M} \boldsymbol{\mu} - \mathbf{M}^{\mathsf{T}} \mathbf{M} \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{x}(t) \\ \end{align*} $$
So the $\boldsymbol{\mu}$ that satisfies $\nabla_{\boldsymbol{\mu}} \tilde{J} = \mathbf{0}$ is the sample mean.
$$ \boldsymbol{\mu}^{\ast} = \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{x}(t) $$
Matrix Derivative of a Scalar Function:
$$ \nabla_{\mathbf{X}} (\Tr \mathbf{X}) = I $$
$$ \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}}\mathbf{X}^{\mathsf{T}}\mathbf{X}\mathbf{a}) = 2\mathbf{X}\mathbf{a}\mathbf{a}^{\mathsf{T}} $$
By this formula, computing $\nabla_{\mathbf{M}} \tilde{J}$ (which matches the result in the paper, though the following expression is a more simplified presentation):
$$ \begin{align*} \nabla_{\mathbf{M}} \tilde{J} &= \dfrac{1}{T} \sum\limits_{t=1}^{T} [-I + \mathbf{M}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}}] \\ &= -I + \mathbf{M} \sum\limits_{t=1}^{T}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}} \end{align*} $$
For the above equation to be $\mathbf{0}$, $\mathbf{M}$ should become the inverse of the sample covariance matrix $\sum\limits_{t=1}^{T}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}}$.
As evident from the result, Score Matching provides an estimator equivalent to the maximum likelihood estimation.
Appendix A. Proof of Theorem 1
Expanding the norm $\| \cdot \|^{2} = \Braket{\cdot, \cdot}$ of $(1)$ gives the following:
$$ \begin{align*} J(\boldsymbol{\theta}) &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}) \right\|^{2} \mathrm{d}\boldsymbol{\xi} \\ &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Braket{ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}), \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi})} \mathrm{d}\boldsymbol{\xi} \\ &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \Braket{ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) } + \Braket{\psi_{\mathbf{x}}(\boldsymbol{\xi}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} - 2\Braket{\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} \Big] \mathrm{d}\boldsymbol{\xi} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \dfrac{1}{2}\| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} - \Braket{\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} \Big] \mathrm{d}\boldsymbol{\xi} \end{align*} $$
Examining the integral of only the third term, we have:
$$ -\sum\limits_{i} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \psi_{\mathbf{x}, i}(\boldsymbol{\xi}) \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} $$
For each $i$, the integral can be rewritten as follows, applying the derivative rule of the logarithm:
$$ \begin{align*} -\int p_{\mathbf{x}}(\boldsymbol{\xi}) \psi_{\mathbf{x}, i}(\boldsymbol{\xi}) \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} &= -\int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \log p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\ &= -\int p_{\mathbf{x}}(\boldsymbol{\xi})\left( \dfrac{1}{p_{\mathbf{x}}(\boldsymbol{\xi})} \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \right)\psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\ &= -\int \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\ \end{align*} $$
This can be further rewritten using the integration by parts:
$$ \begin{align*} & -\int \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\ &= -\int \left( \int \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\xi_{1} \right) \mathrm{d}(\xi_{2},\dots,\xi_{n}) \\ &= -\int \left( \left[p_{\mathbf{x}}(\boldsymbol{\xi}) \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \right]_{\xi_{1}=-\infty}^{\infty} - \int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} \mathrm{d}\xi_{1} \right) \mathrm{d}(\xi_{2},\dots,\xi_{n}) \\ &= \int \int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} \mathrm{d}\xi_{1}\mathrm{d}(\xi_{2},\dots,\xi_{n}) \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} \mathrm{d}\boldsymbol{\xi} = \int p_{\mathbf{x}}(\boldsymbol{\xi}) \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \end{align*} $$
Here, since $p_{\mathbf{x}}$ is a probability density function, it must be integrable, leading to $\lim\limits_{\boldsymbol{\xi} \to \pm \infty}p_{\mathbf{x}}(\boldsymbol{\xi}) = 0$, thus the definite integral value in brackets is $0$. Substituting and rearranging, we get:
$$ \begin{align*} J(\boldsymbol{\theta}) &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \dfrac{1}{2}\| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} - \Braket{\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} \Big] \mathrm{d}\boldsymbol{\xi} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \sum\limits_{i}\partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \Big] \mathrm{d}\boldsymbol{\xi} + \dfrac{1}{2}\int p_{\mathbf{x}}(\boldsymbol{\xi}) \| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} \mathrm{d}\boldsymbol{\xi} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \sum\limits_{i}\partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \Big] \mathrm{d}\boldsymbol{\xi} + \dfrac{1}{2}\int p_{\mathbf{x}}(\boldsymbol{\xi}) \| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} \mathrm{d}\boldsymbol{\xi} \\ \end{align*} $$
The last term is a constant not depending on $\boldsymbol{\theta}$. Hence, we obtain:
$$ J(\boldsymbol{\theta}) = \int p_{\mathbf{x}}(\boldsymbol{\xi}) \sum\limits_{i}\Big[ \frac{1}{2} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})^{2} + \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \Big] \mathrm{d}\boldsymbol{\xi} + \text{constant} $$
■
Appendix B. Proof of Theorem 2
$$ J(\boldsymbol{\theta}) = 0 \iff \boldsymbol{\theta} = \boldsymbol{\theta}^{\ast} $$
($\implies$)
Assume $J(\boldsymbol{\theta}) = 0$.
$$ J = \int p_{x}(\boldsymbol{\xi}) \| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}) \|^{2} \mathrm{d}\boldsymbol{\xi} = 0 = \left\langle p_{x}, \| \psi - \psi_{\mathbf{x}} \|^{2} \right\rangle $$
For all $\boldsymbol{\xi}$, since $q \gt 0$, $p(\boldsymbol{\xi}) \gt 0$ is $\forall \boldsymbol{\xi}$, and since $\| \psi - \psi_{\mathbf{x}} \|^{2} \ge 0$, for all $\boldsymbol{\xi}$, it must be $\| \psi - \psi_{\mathbf{x}} \|^{2} = 0$.
$$ J = 0 = \left\langle p_{x}, \| \psi - \psi_{\mathbf{x}} \|^{2} \right\rangle \implies \| \psi - \psi_{\mathbf{x}} \|^{2} = 0 \implies \psi = \psi_{\mathbf{x}} $$
This implies the following:
$$ \psi_{\mathbf{x}} = \psi \implies \log p_{\mathbf{x}}(\cdot) = \log p( \cdot; \boldsymbol{\theta}) + \text{constant} $$
Given $p$, probability density function $p_{\mathbf{x}}$ should integrate to $1$, meaning the constant must be $0$. Hence, $p_{\mathbf{x}}(\cdot) = p( \cdot; \boldsymbol{\theta})$. By assumption, $\boldsymbol{\theta} = \boldsymbol{\theta}^{\ast}$.
■
($\impliedby$)
It is obvious.
■