logo

Paper Review: Score Matching 📂Machine Learning

Paper Review: Score Matching

Overview

Score Matching is a statistical technique introduced in the 2005 paper by Aapo Hyvarinen, Estimation of Non-Normalized Statistical Models by Score Matching, which provides a method for estimating non-normalized models without considering the normalization constant.

1. Introduction

In many cases, probabilistic models are given as non-normalized models containing a normalization constant ZZ. For instance, a probability density function pθp_{\boldsymbol{\theta}} with parameters θ\boldsymbol{\theta} is defined as follows:

p(ξ;θ)=1Z(θ)q(ξ;θ) p(\boldsymbol{\xi}; \boldsymbol{\theta}) = \dfrac{1}{Z(\boldsymbol{\theta})} q(\boldsymbol{\xi}; \boldsymbol{\theta})

Here, the issue encountered is that while qq is analytically well-defined or easy to compute, Z(θ)=q(ξ;θ)dξZ(\boldsymbol{\theta}) = \int q(\boldsymbol{\xi}; \boldsymbol{\theta}) d \boldsymbol{\xi} is often difficult to calculate. Particularly when θ\boldsymbol{\theta} is a high-dimensional vector, it can be practically impossible to compute, including problems such as the curse of dimensionality. Previously, methods like Markov chain Monte Carlo were commonly used for estimating non-normalized models, but these methods are slow, and other methods often underperform.

2. Estimation by Score Matching

The crux of the proposed method is the [score function]. Denoting the score function of a model’s probability density function p(ξ;θ)p(\boldsymbol{\xi}; \boldsymbol{\theta}) that approximates the distribution of data as ψ(ξ;θ)\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), it is defined as follows:

ψ(ξ;θ)=[ψ1(ξ;θ)ψn(ξ;θ)]:=[logp(ξ;θ)ξ1logp(ξ;θ)ξn]=ξlogp(ξ;θ) \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) = \begin{bmatrix} \psi_{1}(\boldsymbol{\xi}; \boldsymbol{\theta}) \\ \vdots \\ \psi_{n}(\boldsymbol{\xi}; \boldsymbol{\theta}) \end{bmatrix} := \begin{bmatrix} \dfrac{\partial \log p(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{1}} \\ \vdots \\ \dfrac{\partial \log p(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{n}} \end{bmatrix} = \nabla_{\boldsymbol{\xi}} \log p(\boldsymbol{\xi}; \boldsymbol{\theta})

In other words, the score function is the gradient of the log-probability density function. The method proposed in the paper allows us to ignore the normalization constant, thus redefining the score function as follows:

ψ(ξ;θ)=ξlogq(ξ;θ) \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) = \nabla_{\boldsymbol{\xi}} \log q(\boldsymbol{\xi}; \boldsymbol{\theta})

Let’s denote the score function for the actual data distribution x\mathbf{x} as follows:

ψx()=ξlogpx() \psi_{\mathbf{x}}( \cdot ) = \nabla_{\boldsymbol{\xi}} \log p_{\mathbf{x}}( \cdot )

The paper sets up the objective function such that the expected value difference between the data’s score and the model’s score decreases, as follows:

J(θ)=12px(ξ)ψ(ξ;θ)ψx(ξ)2dξ=12px(ξ)ξlogq(ξ;θ)ξlogpx(ξ)2dξ \begin{equation} \begin{aligned} J(\boldsymbol{\theta}) &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}) \right\|^{2} \mathrm{d}\boldsymbol{\xi} \\ &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left\| \nabla_{\boldsymbol{\xi}} \log q(\boldsymbol{\xi}; \boldsymbol{\theta}) - \nabla_{\boldsymbol{\xi}} \log p_{\mathbf{x}}(\boldsymbol{\xi}) \right\|^{2} \mathrm{d}\boldsymbol{\xi} \end{aligned} \end{equation}

Thus, Score Matching refers to a method for estimating θ\boldsymbol{\theta} as follows:

θ^=arg minθJ(θ) \hat{\boldsymbol{\theta}} = \argmin\limits_{\boldsymbol{\theta}} J(\boldsymbol{\theta})

However, if we inspect (1)(1), there is indeed an issue: calculating ψx(ξ)=ξlogpx(ξ)\psi_{\mathbf{x}}(\boldsymbol{\xi}) = \nabla_{\boldsymbol{\xi}} \log p_{\mathbf{x}}(\boldsymbol{\xi}) requires knowing pxp_{\mathbf{x}}. Since pxp_{\mathbf{x}} is unknown, we approximate it using model p(ξ;θ)p(\boldsymbol{\xi}; \boldsymbol{\theta}), but this appears to be contradictory because approximating requires knowing pxp_{\mathbf{x}}. In fact, from the below theorem, (1)(1) can be restructured without ψx\psi_{\mathbf{x}}.


Assume the score function ψ(ξ;θ)\psi(\boldsymbol{\xi}; \boldsymbol{\theta}) of model Theorem 1\textbf{Theorem 1} is differentiable. Then (1)(1) can be expressed as:

J(θ)=px(ξ)i=1n[iψi(ξ;θ)+12ψi(ξ;θ)2]dξ+constant=px(ξ)[i=1niψi(ξ;θ)+12ψ(ξ;θ),ψ(ξ;θ)]dξ+constant=px(ξ)[Δlogq(ξ;θ)+12ψ(ξ;θ)22]dξ+constant \begin{align*} J(\boldsymbol{\theta}) &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \sum\limits_{i=1}^{n} \left[ \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) + \dfrac{1}{2} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})^{2} \right] \mathrm{d}\boldsymbol{\xi} + \text{constant} \tag{2} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left[\sum\limits_{i=1}^{n} \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) + \dfrac{1}{2} \Braket{ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi(\boldsymbol{\xi}; \boldsymbol{\theta})} \right] \mathrm{d}\boldsymbol{\xi} + \text{constant} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left[\Delta \log q(\boldsymbol{\xi}; \boldsymbol{\theta}) + \dfrac{1}{2} \| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) \|_{2}^{2} \right] \mathrm{d}\boldsymbol{\xi} + \text{constant} \end{align*}

Here, constant\text{constant} is a constant that does not depend on θ\boldsymbol{\theta}. ψi\psi_{i} represents the ii-th component of the score, and iψi\partial_{i} \psi_{i} is the partial derivative of the ii-th component of the score function with respect to the ii-th variable.

ψi(ξ;θ)=logq(ξ;θ)ξi \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) = \dfrac{\partial \log q(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} iψi(ξ;θ)=ψi(ξ;θ)ξi=2logq(ξ;θ)ξi2 \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) = \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} = \dfrac{\partial^{2} \log q(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}^{2}}


Refer to the Appendix for the proof. The first term inside the brackets is identical to the Laplacian of the probability density function qq when factored by \sum.

Δq=2logq=i=1n2logqξi2 \Delta q = \nabla^{2} \log q = \sum\limits_{i=1}^{n} \dfrac{\partial^{2} \log q}{\partial \xi_{i}^{2}}

Since, in reality, only a finite amount of data can be processed, if TT samples x(1),,x(T)\mathbf{x}(1), \dots, \mathbf{x}(T) are given, the expression for the samples is as follows:

J~(θ)=1Tt=1Ti=1n[iψi(x(t);θ)+12ψi(x(t);θ)2]+constant \tilde{J}(\boldsymbol{\theta}) = \dfrac{1}{T} \sum\limits_{t=1}^{T} \sum\limits_{i=1}^{n} \left[ \partial_{i} \psi_{i}(\mathbf{x}(t); \boldsymbol{\theta}) + \dfrac{1}{2} \psi_{i}(\mathbf{x}(t); \boldsymbol{\theta})^{2} \right] + \text{constant}

Subsequent to this, from the following theorem, we can see that minimizing (2)(2) is indeed sufficient for model estimation.


Assume Theorem 2\textbf{Theorem 2} exists uniquely such that px()=p(;θ)p_{\mathbf{x}}(\cdot) = p(\cdot; \boldsymbol{\theta}^{\ast}) holds. Also assume q(ξ;θ)>0q(\boldsymbol{\xi}; \boldsymbol{\theta}) > 0. Then the following holds:

J(θ)=0    θ=θ J(\boldsymbol{\theta}) = 0 \iff \boldsymbol{\theta} = \boldsymbol{\theta}^{\ast}


Under the assumptions of the preceding theorems, the score matching estimator obtained by minimizing J~\tilde{J} is a consistent estimator. That is, as the sample size increases indefinitely, the estimator statistically converges to the true value θ\boldsymbol{\theta}^{\ast}, assuming that the optimization algorithm can find the global minimum.


As the sample size increases, J~\tilde{J} converges to JJ, Therefore by the law of large numbers, Corollary\text{Corollary} holds.

3. Examples

3.1 Multivariate Gaussian Density

Consider a very simple case with a multivariate normal distribution:

p(x;M,μ)=1Z(M,μ)exp(12(xμ)TM(xμ)) p(\mathbf{x}; \mathbf{M}, \boldsymbol{\mu}) = \dfrac{1}{Z(\mathbf{M}, \boldsymbol{\mu} )} \exp \left( -\dfrac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \mathbf{M} (\mathbf{x} - \boldsymbol{\mu}) \right)

Here, MRn×n\mathbf{M} \in \mathbb{R}^{n \times n} is the inverse of the covariance matrix, which is positive definite and a symmetric matrix. μRn\boldsymbol{\mu} \in \mathbb{R}^{n} is the mean vector. Although Z(M,μ)=((2π)ndetM)1/2Z(\mathbf{M}, \boldsymbol{\mu}) = ((2\pi)^{n} \det \mathbf{M})^{1/2} is well-known in this case, let’s consider it as a simple example.

3.1.1 Estimation

In this case, qq, ψ\psi, and iψ\partial_{i} \psi are expressed as follows:

q(x)=exp(12(xμ)TM(xμ)) q(\mathbf{x}) = \exp \left( -\dfrac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \mathbf{M} (\mathbf{x} - \boldsymbol{\mu}) \right)

For a symmetric matrix R\mathbf{R}, the gradient of the quadratic form is x(xTRx)=2Rx\nabla_{\mathbf{x}} (\mathbf{x}^{\mathsf{T}} \mathbf{R} \mathbf{x}) = 2 \mathbf{R} \mathbf{x}, thus:

ψ(x;M,μ)=M(xμ)=[jm1j(xjμj)jmnj(xjμj)] \psi(\mathbf{x}; \mathbf{M}, \boldsymbol{\mu}) = -\mathbf{M} (\mathbf{x} - \boldsymbol{\mu}) = - \begin{bmatrix} \sum\limits_{j}m_{1j}(x_{j}-\mu_{j}) \\ \vdots \\[1em] \sum\limits_{j}m_{nj}(x_{j}-\mu_{j})\end{bmatrix}

Here, M=[mij]\mathbf{M} = [m_{ij}]. iψi=ψixi\partial_{i} \psi_{i} = \dfrac{\partial \psi_{i}}{\partial x_{i}} is as follows:

iψi(x;M,μ)=mii \partial_{i} \psi_{i}(\mathbf{x}; \mathbf{M}, \boldsymbol{\mu}) = -m_{ii}

Therefore, J~\tilde{J} is as follows. Given iψi=ψ,ψ=ψTψ\sum_{i} \psi_{i} = \braket{\psi, \psi} = \psi^{\mathsf{T}} \psi:

J~(M,μ)=1Tt=1T[imii+12(x(t)μ)TMTM(x(t)μ)]=1Tt=1T[Tr(M)+12(x(t)μ)TMTM(x(t)μ)] \begin{align*} \tilde{J}(\mathbf{M}, \boldsymbol{\mu}) &= \dfrac{1}{T} \sum\limits_{t=1}^{T} \left[ \sum\limits_{i} -m_{ii} + \dfrac{1}{2}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}} \mathbf{M}^{\mathsf{T}} \mathbf{M} \left( \mathbf{x}(t) - \boldsymbol{\mu} \right) \right] \\ &= \dfrac{1}{T} \sum\limits_{t=1}^{T} \left[ - \Tr (\mathbf{M}) + \dfrac{1}{2}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}} \mathbf{M}^{\mathsf{T}} \mathbf{M} \left( \mathbf{x}(t) - \boldsymbol{\mu} \right) \right] \end{align*}

Here, Tr\Tr represents the trace. To find μ\boldsymbol{\mu} that minimizes the above equation, we compute the gradient, and by the gradient formula of a quadratic matrix:

μJ~=1Tt=1T[MTM(μx(t))]=1Tt=1TMTMμ1Tt=1TMTMx(t)=MTMμMTM1Tt=1Tx(t) \begin{align*} \nabla_{\boldsymbol{\mu}} \tilde{J} &= \dfrac{1}{T} \sum\limits_{t=1}^{T} \left[ \mathbf{M}^{\mathsf{T}} \mathbf{M} \left( \boldsymbol{\mu} - \mathbf{x}(t) \right) \right] \\ &= \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{M}^{\mathsf{T}} \mathbf{M} \boldsymbol{\mu} - \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{M}^{\mathsf{T}} \mathbf{M} \mathbf{x}(t) \\ &= \mathbf{M}^{\mathsf{T}} \mathbf{M} \boldsymbol{\mu} - \mathbf{M}^{\mathsf{T}} \mathbf{M} \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{x}(t) \\ \end{align*}

So the μ\boldsymbol{\mu} that satisfies μJ~=0\nabla_{\boldsymbol{\mu}} \tilde{J} = \mathbf{0} is the sample mean.

μ=1Tt=1Tx(t) \boldsymbol{\mu}^{\ast} = \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{x}(t)

Matrix Derivative of a Scalar Function:

X(TrX)=I \nabla_{\mathbf{X}} (\Tr \mathbf{X}) = I

X(aTXTXa)=2XaaT \nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}}\mathbf{X}^{\mathsf{T}}\mathbf{X}\mathbf{a}) = 2\mathbf{X}\mathbf{a}\mathbf{a}^{\mathsf{T}}

By this formula, computing MJ~\nabla_{\mathbf{M}} \tilde{J} (which matches the result in the paper, though the following expression is a more simplified presentation):

MJ~=1Tt=1T[I+M(x(t)μ)(x(t)μ)T]=I+Mt=1T(x(t)μ)(x(t)μ)T \begin{align*} \nabla_{\mathbf{M}} \tilde{J} &= \dfrac{1}{T} \sum\limits_{t=1}^{T} [-I + \mathbf{M}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}}] \\ &= -I + \mathbf{M} \sum\limits_{t=1}^{T}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}} \end{align*}

For the above equation to be 0\mathbf{0}, M\mathbf{M} should become the inverse of the sample covariance matrix t=1T(x(t)μ)(x(t)μ)T\sum\limits_{t=1}^{T}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}}.

As evident from the result, Score Matching provides an estimator equivalent to the maximum likelihood estimation.

Appendix A. Proof of Theorem 1

Expanding the norm 2=,\| \cdot \|^{2} = \Braket{\cdot, \cdot} of (1)(1) gives the following:

J(θ)=12px(ξ)ψ(ξ;θ)ψx(ξ)2dξ=12px(ξ)ψ(ξ;θ)ψx(ξ),ψ(ξ;θ)ψx(ξ)dξ=12px(ξ)[ψ(ξ;θ),ψ(ξ;θ)+ψx(ξ),ψx(ξ)2ψ(ξ;θ),ψx(ξ)]dξ=px(ξ)[12ψ(ξ;θ)2+12ψx(ξ)2ψ(ξ;θ),ψx(ξ)]dξ \begin{align*} J(\boldsymbol{\theta}) &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}) \right\|^{2} \mathrm{d}\boldsymbol{\xi} \\ &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Braket{ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}), \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi})} \mathrm{d}\boldsymbol{\xi} \\ &= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \Braket{ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) } + \Braket{\psi_{\mathbf{x}}(\boldsymbol{\xi}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} - 2\Braket{\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} \Big] \mathrm{d}\boldsymbol{\xi} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \dfrac{1}{2}\| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} - \Braket{\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} \Big] \mathrm{d}\boldsymbol{\xi} \end{align*}

Examining the integral of only the third term, we have:

ipx(ξ)ψx,i(ξ)ψi(ξ;θ)dξ -\sum\limits_{i} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \psi_{\mathbf{x}, i}(\boldsymbol{\xi}) \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi}

For each ii, the integral can be rewritten as follows, applying the derivative rule of the logarithm:

px(ξ)ψx,i(ξ)ψi(ξ;θ)dξ=px(ξ)logpx(ξ)ξiψi(ξ;θ)dξ=px(ξ)(1px(ξ)px(ξ)ξi)ψi(ξ;θ)dξ=px(ξ)ξiψi(ξ;θ)dξ \begin{align*} -\int p_{\mathbf{x}}(\boldsymbol{\xi}) \psi_{\mathbf{x}, i}(\boldsymbol{\xi}) \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} &= -\int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \log p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\ &= -\int p_{\mathbf{x}}(\boldsymbol{\xi})\left( \dfrac{1}{p_{\mathbf{x}}(\boldsymbol{\xi})} \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \right)\psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\ &= -\int \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\ \end{align*}

This can be further rewritten using the integration by parts:

px(ξ)ξiψi(ξ;θ)dξ=(px(ξ)ξiψi(ξ;θ)dξ1)d(ξ2,,ξn)=([px(ξ)ψi(ξ;θ)]ξ1=px(ξ)ψi(ξ;θ)ξidξ1)d(ξ2,,ξn)=px(ξ)ψi(ξ;θ)ξidξ1d(ξ2,,ξn)=px(ξ)ψi(ξ;θ)ξidξ=px(ξ)iψi(ξ;θ)dξ \begin{align*} & -\int \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\ &= -\int \left( \int \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\xi_{1} \right) \mathrm{d}(\xi_{2},\dots,\xi_{n}) \\ &= -\int \left( \left[p_{\mathbf{x}}(\boldsymbol{\xi}) \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \right]_{\xi_{1}=-\infty}^{\infty} - \int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} \mathrm{d}\xi_{1} \right) \mathrm{d}(\xi_{2},\dots,\xi_{n}) \\ &= \int \int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} \mathrm{d}\xi_{1}\mathrm{d}(\xi_{2},\dots,\xi_{n}) \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} \mathrm{d}\boldsymbol{\xi} = \int p_{\mathbf{x}}(\boldsymbol{\xi}) \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \end{align*}

Here, since pxp_{\mathbf{x}} is a probability density function, it must be integrable, leading to limξ±px(ξ)=0\lim\limits_{\boldsymbol{\xi} \to \pm \infty}p_{\mathbf{x}}(\boldsymbol{\xi}) = 0, thus the definite integral value in brackets is 00. Substituting and rearranging, we get:

J(θ)=px(ξ)[12ψ(ξ;θ)2+12ψx(ξ)2ψ(ξ;θ),ψx(ξ)]dξ=px(ξ)[12ψ(ξ;θ)2+iiψi(ξ;θ)]dξ+12px(ξ)ψx(ξ)2dξ=px(ξ)[12ψ(ξ;θ)2+iiψi(ξ;θ)]dξ+12px(ξ)ψx(ξ)2dξ \begin{align*} J(\boldsymbol{\theta}) &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \dfrac{1}{2}\| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} - \Braket{\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} \Big] \mathrm{d}\boldsymbol{\xi} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \sum\limits_{i}\partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \Big] \mathrm{d}\boldsymbol{\xi} + \dfrac{1}{2}\int p_{\mathbf{x}}(\boldsymbol{\xi}) \| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} \mathrm{d}\boldsymbol{\xi} \\ &= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \sum\limits_{i}\partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \Big] \mathrm{d}\boldsymbol{\xi} + \dfrac{1}{2}\int p_{\mathbf{x}}(\boldsymbol{\xi}) \| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} \mathrm{d}\boldsymbol{\xi} \\ \end{align*}

The last term is a constant not depending on θ\boldsymbol{\theta}. Hence, we obtain:

J(θ)=px(ξ)i[12ψi(ξ;θ)2+iψi(ξ;θ)]dξ+constant J(\boldsymbol{\theta}) = \int p_{\mathbf{x}}(\boldsymbol{\xi}) \sum\limits_{i}\Big[ \frac{1}{2} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})^{2} + \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \Big] \mathrm{d}\boldsymbol{\xi} + \text{constant}

Appendix B. Proof of Theorem 2

J(θ)=0    θ=θ J(\boldsymbol{\theta}) = 0 \iff \boldsymbol{\theta} = \boldsymbol{\theta}^{\ast}

(    \implies)

Assume J(θ)=0J(\boldsymbol{\theta}) = 0.

J=px(ξ)ψ(ξ;θ)ψx(ξ)2dξ=0=px,ψψx2 J = \int p_{x}(\boldsymbol{\xi}) \| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}) \|^{2} \mathrm{d}\boldsymbol{\xi} = 0 = \left\langle p_{x}, \| \psi - \psi_{\mathbf{x}} \|^{2} \right\rangle

For all ξ\boldsymbol{\xi}, since q>0q \gt 0, p(ξ)>0p(\boldsymbol{\xi}) \gt 0 is ξ\forall \boldsymbol{\xi}, and since ψψx20\| \psi - \psi_{\mathbf{x}} \|^{2} \ge 0, for all ξ\boldsymbol{\xi}, it must be ψψx2=0\| \psi - \psi_{\mathbf{x}} \|^{2} = 0.

J=0=px,ψψx2    ψψx2=0    ψ=ψx J = 0 = \left\langle p_{x}, \| \psi - \psi_{\mathbf{x}} \|^{2} \right\rangle \implies \| \psi - \psi_{\mathbf{x}} \|^{2} = 0 \implies \psi = \psi_{\mathbf{x}}

This implies the following:

ψx=ψ    logpx()=logp(;θ)+constant \psi_{\mathbf{x}} = \psi \implies \log p_{\mathbf{x}}(\cdot) = \log p( \cdot; \boldsymbol{\theta}) + \text{constant}

Given pp, probability density function pxp_{\mathbf{x}} should integrate to 11, meaning the constant must be 00. Hence, px()=p(;θ)p_{\mathbf{x}}(\cdot) = p( \cdot; \boldsymbol{\theta}). By assumption, θ=θ\boldsymbol{\theta} = \boldsymbol{\theta}^{\ast}.

(    \impliedby)

It is obvious.