Paper Review: Score Matching
📂Machine Learning Paper Review: Score Matching Overview Score Matching is a statistical technique introduced in the 2005 paper by Aapo Hyvarinen, Estimation of Non-Normalized Statistical Models by Score Matching , which provides a method for estimating non-normalized models without considering the normalization constant.
1. Introduction In many cases, probabilistic models are given as non-normalized models containing a normalization constant Z Z Z . For instance, a probability density function p θ p_{\boldsymbol{\theta}} p θ with parameters θ \boldsymbol{\theta} θ is defined as follows:
p ( ξ ; θ ) = 1 Z ( θ ) q ( ξ ; θ )
p(\boldsymbol{\xi}; \boldsymbol{\theta}) = \dfrac{1}{Z(\boldsymbol{\theta})} q(\boldsymbol{\xi}; \boldsymbol{\theta})
p ( ξ ; θ ) = Z ( θ ) 1 q ( ξ ; θ )
Here, the issue encountered is that while q q q is analytically well-defined or easy to compute, Z ( θ ) = ∫ q ( ξ ; θ ) d ξ Z(\boldsymbol{\theta}) = \int q(\boldsymbol{\xi}; \boldsymbol{\theta}) d \boldsymbol{\xi} Z ( θ ) = ∫ q ( ξ ; θ ) d ξ is often difficult to calculate. Particularly when θ \boldsymbol{\theta} θ is a high-dimensional vector, it can be practically impossible to compute, including problems such as the curse of dimensionality . Previously, methods like Markov chain Monte Carlo were commonly used for estimating non-normalized models, but these methods are slow, and other methods often underperform.
2. Estimation by Score Matching The crux of the proposed method is the [score function]. Denoting the score function of a model’s probability density function p ( ξ ; θ ) p(\boldsymbol{\xi}; \boldsymbol{\theta}) p ( ξ ; θ ) that approximates the distribution of data as ψ ( ξ ; θ ) \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) ψ ( ξ ; θ ) , it is defined as follows:
ψ ( ξ ; θ ) = [ ψ 1 ( ξ ; θ ) ⋮ ψ n ( ξ ; θ ) ] : = [ ∂ log p ( ξ ; θ ) ∂ ξ 1 ⋮ ∂ log p ( ξ ; θ ) ∂ ξ n ] = ∇ ξ log p ( ξ ; θ )
\psi(\boldsymbol{\xi}; \boldsymbol{\theta})
= \begin{bmatrix} \psi_{1}(\boldsymbol{\xi}; \boldsymbol{\theta}) \\ \vdots \\ \psi_{n}(\boldsymbol{\xi}; \boldsymbol{\theta}) \end{bmatrix}
:= \begin{bmatrix} \dfrac{\partial \log p(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{1}} \\ \vdots \\ \dfrac{\partial \log p(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{n}} \end{bmatrix}
= \nabla_{\boldsymbol{\xi}} \log p(\boldsymbol{\xi}; \boldsymbol{\theta})
ψ ( ξ ; θ ) = ψ 1 ( ξ ; θ ) ⋮ ψ n ( ξ ; θ ) := ∂ ξ 1 ∂ log p ( ξ ; θ ) ⋮ ∂ ξ n ∂ log p ( ξ ; θ ) = ∇ ξ log p ( ξ ; θ )
In other words, the score function is the gradient of the log-probability density function. The method proposed in the paper allows us to ignore the normalization constant, thus redefining the score function as follows:
ψ ( ξ ; θ ) = ∇ ξ log q ( ξ ; θ )
\psi(\boldsymbol{\xi}; \boldsymbol{\theta}) = \nabla_{\boldsymbol{\xi}} \log q(\boldsymbol{\xi}; \boldsymbol{\theta})
ψ ( ξ ; θ ) = ∇ ξ log q ( ξ ; θ )
Let’s denote the score function for the actual data distribution x \mathbf{x} x as follows:
ψ x ( ⋅ ) = ∇ ξ log p x ( ⋅ )
\psi_{\mathbf{x}}( \cdot ) = \nabla_{\boldsymbol{\xi}} \log p_{\mathbf{x}}( \cdot )
ψ x ( ⋅ ) = ∇ ξ log p x ( ⋅ )
The paper sets up the objective function such that the expected value difference between the data’s score and the model’s score decreases, as follows:
J ( θ ) = 1 2 ∫ p x ( ξ ) ∥ ψ ( ξ ; θ ) − ψ x ( ξ ) ∥ 2 d ξ = 1 2 ∫ p x ( ξ ) ∥ ∇ ξ log q ( ξ ; θ ) − ∇ ξ log p x ( ξ ) ∥ 2 d ξ
\begin{equation}
\begin{aligned}
J(\boldsymbol{\theta})
&= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}) \right\|^{2} \mathrm{d}\boldsymbol{\xi} \\
&= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left\| \nabla_{\boldsymbol{\xi}} \log q(\boldsymbol{\xi}; \boldsymbol{\theta}) - \nabla_{\boldsymbol{\xi}} \log p_{\mathbf{x}}(\boldsymbol{\xi}) \right\|^{2} \mathrm{d}\boldsymbol{\xi}
\end{aligned}
\end{equation}
J ( θ ) = 2 1 ∫ p x ( ξ ) ∥ ψ ( ξ ; θ ) − ψ x ( ξ ) ∥ 2 d ξ = 2 1 ∫ p x ( ξ ) ∥ ∇ ξ log q ( ξ ; θ ) − ∇ ξ log p x ( ξ ) ∥ 2 d ξ
Thus, Score Matching refers to a method for estimating θ \boldsymbol{\theta} θ as follows:
θ ^ = arg min θ J ( θ )
\hat{\boldsymbol{\theta}} = \argmin\limits_{\boldsymbol{\theta}} J(\boldsymbol{\theta})
θ ^ = θ arg min J ( θ )
However, if we inspect ( 1 ) (1) ( 1 ) , there is indeed an issue: calculating ψ x ( ξ ) = ∇ ξ log p x ( ξ ) \psi_{\mathbf{x}}(\boldsymbol{\xi}) = \nabla_{\boldsymbol{\xi}} \log p_{\mathbf{x}}(\boldsymbol{\xi}) ψ x ( ξ ) = ∇ ξ log p x ( ξ ) requires knowing p x p_{\mathbf{x}} p x . Since p x p_{\mathbf{x}} p x is unknown, we approximate it using model p ( ξ ; θ ) p(\boldsymbol{\xi}; \boldsymbol{\theta}) p ( ξ ; θ ) , but this appears to be contradictory because approximating requires knowing p x p_{\mathbf{x}} p x . In fact, from the below theorem, ( 1 ) (1) ( 1 ) can be restructured without ψ x \psi_{\mathbf{x}} ψ x .
Assume the score function ψ ( ξ ; θ ) \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) ψ ( ξ ; θ ) of model Theorem 1 \textbf{Theorem 1} Theorem 1 is differentiable . Then ( 1 ) (1) ( 1 ) can be expressed as:
J ( θ ) = ∫ p x ( ξ ) ∑ i = 1 n [ ∂ i ψ i ( ξ ; θ ) + 1 2 ψ i ( ξ ; θ ) 2 ] d ξ + constant = ∫ p x ( ξ ) [ ∑ i = 1 n ∂ i ψ i ( ξ ; θ ) + 1 2 ⟨ ψ ( ξ ; θ ) , ψ ( ξ ; θ ) ⟩ ] d ξ + constant = ∫ p x ( ξ ) [ Δ log q ( ξ ; θ ) + 1 2 ∥ ψ ( ξ ; θ ) ∥ 2 2 ] d ξ + constant
\begin{align*}
J(\boldsymbol{\theta})
&= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \sum\limits_{i=1}^{n} \left[ \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) + \dfrac{1}{2} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})^{2} \right] \mathrm{d}\boldsymbol{\xi} + \text{constant} \tag{2} \\
&= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left[\sum\limits_{i=1}^{n} \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) + \dfrac{1}{2} \Braket{ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi(\boldsymbol{\xi}; \boldsymbol{\theta})} \right] \mathrm{d}\boldsymbol{\xi} + \text{constant} \\
&= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left[\Delta \log q(\boldsymbol{\xi}; \boldsymbol{\theta}) + \dfrac{1}{2} \| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) \|_{2}^{2} \right] \mathrm{d}\boldsymbol{\xi} + \text{constant}
\end{align*}
J ( θ ) = ∫ p x ( ξ ) i = 1 ∑ n [ ∂ i ψ i ( ξ ; θ ) + 2 1 ψ i ( ξ ; θ ) 2 ] d ξ + constant = ∫ p x ( ξ ) [ i = 1 ∑ n ∂ i ψ i ( ξ ; θ ) + 2 1 ⟨ ψ ( ξ ; θ ) , ψ ( ξ ; θ ) ⟩ ] d ξ + constant = ∫ p x ( ξ ) [ Δ log q ( ξ ; θ ) + 2 1 ∥ ψ ( ξ ; θ ) ∥ 2 2 ] d ξ + constant ( 2 )
Here, constant \text{constant} constant is a constant that does not depend on θ \boldsymbol{\theta} θ . ψ i \psi_{i} ψ i represents the i i i -th component of the score, and ∂ i ψ i \partial_{i} \psi_{i} ∂ i ψ i is the partial derivative of the i i i -th component of the score function with respect to the i i i -th variable.
ψ i ( ξ ; θ ) = ∂ log q ( ξ ; θ ) ∂ ξ i
\psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) = \dfrac{\partial \log q(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}}
ψ i ( ξ ; θ ) = ∂ ξ i ∂ log q ( ξ ; θ )
∂ i ψ i ( ξ ; θ ) = ∂ ψ i ( ξ ; θ ) ∂ ξ i = ∂ 2 log q ( ξ ; θ ) ∂ ξ i 2
\partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) = \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} = \dfrac{\partial^{2} \log q(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}^{2}}
∂ i ψ i ( ξ ; θ ) = ∂ ξ i ∂ ψ i ( ξ ; θ ) = ∂ ξ i 2 ∂ 2 log q ( ξ ; θ )
Refer to the Appendix for the proof. The first term inside the brackets is identical to the Laplacian of the probability density function q q q when factored by ∑ \sum ∑ .
Δ q = ∇ 2 log q = ∑ i = 1 n ∂ 2 log q ∂ ξ i 2
\Delta q = \nabla^{2} \log q = \sum\limits_{i=1}^{n} \dfrac{\partial^{2} \log q}{\partial \xi_{i}^{2}}
Δ q = ∇ 2 log q = i = 1 ∑ n ∂ ξ i 2 ∂ 2 log q
Since, in reality, only a finite amount of data can be processed, if T T T samples x ( 1 ) , … , x ( T ) \mathbf{x}(1), \dots, \mathbf{x}(T) x ( 1 ) , … , x ( T ) are given, the expression for the samples is as follows:
J ~ ( θ ) = 1 T ∑ t = 1 T ∑ i = 1 n [ ∂ i ψ i ( x ( t ) ; θ ) + 1 2 ψ i ( x ( t ) ; θ ) 2 ] + constant
\tilde{J}(\boldsymbol{\theta}) = \dfrac{1}{T} \sum\limits_{t=1}^{T} \sum\limits_{i=1}^{n} \left[ \partial_{i} \psi_{i}(\mathbf{x}(t); \boldsymbol{\theta}) + \dfrac{1}{2} \psi_{i}(\mathbf{x}(t); \boldsymbol{\theta})^{2} \right] + \text{constant}
J ~ ( θ ) = T 1 t = 1 ∑ T i = 1 ∑ n [ ∂ i ψ i ( x ( t ) ; θ ) + 2 1 ψ i ( x ( t ) ; θ ) 2 ] + constant
Subsequent to this, from the following theorem, we can see that minimizing ( 2 ) (2) ( 2 ) is indeed sufficient for model estimation.
Assume Theorem 2 \textbf{Theorem 2} Theorem 2 exists uniquely such that p x ( ⋅ ) = p ( ⋅ ; θ ∗ ) p_{\mathbf{x}}(\cdot) = p(\cdot; \boldsymbol{\theta}^{\ast}) p x ( ⋅ ) = p ( ⋅ ; θ ∗ ) holds. Also assume q ( ξ ; θ ) > 0 q(\boldsymbol{\xi}; \boldsymbol{\theta}) > 0 q ( ξ ; θ ) > 0 . Then the following holds:
J ( θ ) = 0 ⟺ θ = θ ∗
J(\boldsymbol{\theta}) = 0 \iff \boldsymbol{\theta} = \boldsymbol{\theta}^{\ast}
J ( θ ) = 0 ⟺ θ = θ ∗
Under the assumptions of the preceding theorems, the score matching estimator obtained by minimizing J ~ \tilde{J} J ~ is a consistent estimator . That is, as the sample size increases indefinitely, the estimator statistically converges to the true value θ ∗ \boldsymbol{\theta}^{\ast} θ ∗ , assuming that the optimization algorithm can find the global minimum .
As the sample size increases, J ~ \tilde{J} J ~ converges to J J J , Therefore by the law of large numbers , Corollary \text{Corollary} Corollary holds.
3. Examples 3.1 Multivariate Gaussian Density Consider a very simple case with a multivariate normal distribution :
p ( x ; M , μ ) = 1 Z ( M , μ ) exp ( − 1 2 ( x − μ ) T M ( x − μ ) )
p(\mathbf{x}; \mathbf{M}, \boldsymbol{\mu}) = \dfrac{1}{Z(\mathbf{M}, \boldsymbol{\mu} )} \exp \left( -\dfrac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \mathbf{M} (\mathbf{x} - \boldsymbol{\mu}) \right)
p ( x ; M , μ ) = Z ( M , μ ) 1 exp ( − 2 1 ( x − μ ) T M ( x − μ ) )
Here, M ∈ R n × n \mathbf{M} \in \mathbb{R}^{n \times n} M ∈ R n × n is the inverse of the covariance matrix , which is positive definite and a symmetric matrix . μ ∈ R n \boldsymbol{\mu} \in \mathbb{R}^{n} μ ∈ R n is the mean vector. Although Z ( M , μ ) = ( ( 2 π ) n det M ) 1 / 2 Z(\mathbf{M}, \boldsymbol{\mu}) = ((2\pi)^{n} \det \mathbf{M})^{1/2} Z ( M , μ ) = (( 2 π ) n det M ) 1/2 is well-known in this case, let’s consider it as a simple example.
3.1.1 Estimation In this case, q q q , ψ \psi ψ , and ∂ i ψ \partial_{i} \psi ∂ i ψ are expressed as follows:
q ( x ) = exp ( − 1 2 ( x − μ ) T M ( x − μ ) )
q(\mathbf{x}) = \exp \left( -\dfrac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\mathsf{T}} \mathbf{M} (\mathbf{x} - \boldsymbol{\mu}) \right)
q ( x ) = exp ( − 2 1 ( x − μ ) T M ( x − μ ) )
For a symmetric matrix R \mathbf{R} R , the gradient of the quadratic form is ∇ x ( x T R x ) = 2 R x \nabla_{\mathbf{x}} (\mathbf{x}^{\mathsf{T}} \mathbf{R} \mathbf{x}) = 2 \mathbf{R} \mathbf{x} ∇ x ( x T Rx ) = 2 Rx , thus:
ψ ( x ; M , μ ) = − M ( x − μ ) = − [ ∑ j m 1 j ( x j − μ j ) ⋮ ∑ j m n j ( x j − μ j ) ]
\psi(\mathbf{x}; \mathbf{M}, \boldsymbol{\mu}) = -\mathbf{M} (\mathbf{x} - \boldsymbol{\mu})
= - \begin{bmatrix} \sum\limits_{j}m_{1j}(x_{j}-\mu_{j}) \\ \vdots \\[1em] \sum\limits_{j}m_{nj}(x_{j}-\mu_{j})\end{bmatrix}
ψ ( x ; M , μ ) = − M ( x − μ ) = − j ∑ m 1 j ( x j − μ j ) ⋮ j ∑ m nj ( x j − μ j )
Here, M = [ m i j ] \mathbf{M} = [m_{ij}] M = [ m ij ] . ∂ i ψ i = ∂ ψ i ∂ x i \partial_{i} \psi_{i} = \dfrac{\partial \psi_{i}}{\partial x_{i}} ∂ i ψ i = ∂ x i ∂ ψ i is as follows:
∂ i ψ i ( x ; M , μ ) = − m i i
\partial_{i} \psi_{i}(\mathbf{x}; \mathbf{M}, \boldsymbol{\mu}) = -m_{ii}
∂ i ψ i ( x ; M , μ ) = − m ii
Therefore, J ~ \tilde{J} J ~ is as follows. Given ∑ i ψ i = ⟨ ψ , ψ ⟩ = ψ T ψ \sum_{i} \psi_{i} = \braket{\psi, \psi} = \psi^{\mathsf{T}} \psi ∑ i ψ i = ⟨ ψ , ψ ⟩ = ψ T ψ :
J ~ ( M , μ ) = 1 T ∑ t = 1 T [ ∑ i − m i i + 1 2 ( x ( t ) − μ ) T M T M ( x ( t ) − μ ) ] = 1 T ∑ t = 1 T [ − Tr ( M ) + 1 2 ( x ( t ) − μ ) T M T M ( x ( t ) − μ ) ]
\begin{align*}
\tilde{J}(\mathbf{M}, \boldsymbol{\mu})
&= \dfrac{1}{T} \sum\limits_{t=1}^{T} \left[ \sum\limits_{i} -m_{ii} + \dfrac{1}{2}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}} \mathbf{M}^{\mathsf{T}} \mathbf{M} \left( \mathbf{x}(t) - \boldsymbol{\mu} \right) \right] \\
&= \dfrac{1}{T} \sum\limits_{t=1}^{T} \left[ - \Tr (\mathbf{M}) + \dfrac{1}{2}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}} \mathbf{M}^{\mathsf{T}} \mathbf{M} \left( \mathbf{x}(t) - \boldsymbol{\mu} \right) \right]
\end{align*}
J ~ ( M , μ ) = T 1 t = 1 ∑ T [ i ∑ − m ii + 2 1 ( x ( t ) − μ ) T M T M ( x ( t ) − μ ) ] = T 1 t = 1 ∑ T [ − Tr ( M ) + 2 1 ( x ( t ) − μ ) T M T M ( x ( t ) − μ ) ]
Here, Tr \Tr Tr represents the trace . To find μ \boldsymbol{\mu} μ that minimizes the above equation, we compute the gradient, and by the gradient formula of a quadratic matrix:
∇ μ J ~ = 1 T ∑ t = 1 T [ M T M ( μ − x ( t ) ) ] = 1 T ∑ t = 1 T M T M μ − 1 T ∑ t = 1 T M T M x ( t ) = M T M μ − M T M 1 T ∑ t = 1 T x ( t )
\begin{align*}
\nabla_{\boldsymbol{\mu}} \tilde{J}
&= \dfrac{1}{T} \sum\limits_{t=1}^{T} \left[ \mathbf{M}^{\mathsf{T}} \mathbf{M} \left( \boldsymbol{\mu} - \mathbf{x}(t) \right) \right] \\
&= \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{M}^{\mathsf{T}} \mathbf{M} \boldsymbol{\mu} - \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{M}^{\mathsf{T}} \mathbf{M} \mathbf{x}(t) \\
&= \mathbf{M}^{\mathsf{T}} \mathbf{M} \boldsymbol{\mu} - \mathbf{M}^{\mathsf{T}} \mathbf{M} \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{x}(t) \\
\end{align*}
∇ μ J ~ = T 1 t = 1 ∑ T [ M T M ( μ − x ( t ) ) ] = T 1 t = 1 ∑ T M T M μ − T 1 t = 1 ∑ T M T Mx ( t ) = M T M μ − M T M T 1 t = 1 ∑ T x ( t )
So the μ \boldsymbol{\mu} μ that satisfies ∇ μ J ~ = 0 \nabla_{\boldsymbol{\mu}} \tilde{J} = \mathbf{0} ∇ μ J ~ = 0 is the sample mean .
μ ∗ = 1 T ∑ t = 1 T x ( t )
\boldsymbol{\mu}^{\ast} = \dfrac{1}{T} \sum\limits_{t=1}^{T} \mathbf{x}(t)
μ ∗ = T 1 t = 1 ∑ T x ( t )
Matrix Derivative of a Scalar Function:
∇ X ( Tr X ) = I
\nabla_{\mathbf{X}} (\Tr \mathbf{X}) = I
∇ X ( Tr X ) = I
∇ X ( a T X T X a ) = 2 X a a T
\nabla_{\mathbf{X}} (\mathbf{a}^{\mathsf{T}}\mathbf{X}^{\mathsf{T}}\mathbf{X}\mathbf{a}) = 2\mathbf{X}\mathbf{a}\mathbf{a}^{\mathsf{T}}
∇ X ( a T X T Xa ) = 2 Xa a T
By this formula, computing ∇ M J ~ \nabla_{\mathbf{M}} \tilde{J} ∇ M J ~ (which matches the result in the paper, though the following expression is a more simplified presentation):
∇ M J ~ = 1 T ∑ t = 1 T [ − I + M ( x ( t ) − μ ) ( x ( t ) − μ ) T ] = − I + M ∑ t = 1 T ( x ( t ) − μ ) ( x ( t ) − μ ) T
\begin{align*}
\nabla_{\mathbf{M}} \tilde{J}
&= \dfrac{1}{T} \sum\limits_{t=1}^{T} [-I + \mathbf{M}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}}] \\
&= -I + \mathbf{M} \sum\limits_{t=1}^{T}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}}
\end{align*}
∇ M J ~ = T 1 t = 1 ∑ T [ − I + M ( x ( t ) − μ ) ( x ( t ) − μ ) T ] = − I + M t = 1 ∑ T ( x ( t ) − μ ) ( x ( t ) − μ ) T
For the above equation to be 0 \mathbf{0} 0 , M \mathbf{M} M should become the inverse of the sample covariance matrix ∑ t = 1 T ( x ( t ) − μ ) ( x ( t ) − μ ) T \sum\limits_{t=1}^{T}\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)\left( \mathbf{x}(t) - \boldsymbol{\mu} \right)^{\mathsf{T}} t = 1 ∑ T ( x ( t ) − μ ) ( x ( t ) − μ ) T .
As evident from the result, Score Matching provides an estimator equivalent to the maximum likelihood estimation.
Appendix A. Proof of Theorem 1 Expanding the norm ∥ ⋅ ∥ 2 = ⟨ ⋅ , ⋅ ⟩ \| \cdot \|^{2} = \Braket{\cdot, \cdot} ∥ ⋅ ∥ 2 = ⟨ ⋅ , ⋅ ⟩ of ( 1 ) (1) ( 1 ) gives the following:
J ( θ ) = 1 2 ∫ p x ( ξ ) ∥ ψ ( ξ ; θ ) − ψ x ( ξ ) ∥ 2 d ξ = 1 2 ∫ p x ( ξ ) ⟨ ψ ( ξ ; θ ) − ψ x ( ξ ) , ψ ( ξ ; θ ) − ψ x ( ξ ) ⟩ d ξ = 1 2 ∫ p x ( ξ ) [ ⟨ ψ ( ξ ; θ ) , ψ ( ξ ; θ ) ⟩ + ⟨ ψ x ( ξ ) , ψ x ( ξ ) ⟩ − 2 ⟨ ψ ( ξ ; θ ) , ψ x ( ξ ) ⟩ ] d ξ = ∫ p x ( ξ ) [ 1 2 ∥ ψ ( ξ ; θ ) ∥ 2 + 1 2 ∥ ψ x ( ξ ) ∥ 2 − ⟨ ψ ( ξ ; θ ) , ψ x ( ξ ) ⟩ ] d ξ
\begin{align*}
J(\boldsymbol{\theta})
&= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \left\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}) \right\|^{2} \mathrm{d}\boldsymbol{\xi} \\
&= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Braket{ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}), \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi})} \mathrm{d}\boldsymbol{\xi} \\
&= \dfrac{1}{2} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \Braket{ \psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) } + \Braket{\psi_{\mathbf{x}}(\boldsymbol{\xi}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} - 2\Braket{\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} \Big] \mathrm{d}\boldsymbol{\xi} \\
&= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \dfrac{1}{2}\| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} - \Braket{\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} \Big] \mathrm{d}\boldsymbol{\xi}
\end{align*}
J ( θ ) = 2 1 ∫ p x ( ξ ) ∥ ψ ( ξ ; θ ) − ψ x ( ξ ) ∥ 2 d ξ = 2 1 ∫ p x ( ξ ) ⟨ ψ ( ξ ; θ ) − ψ x ( ξ ) , ψ ( ξ ; θ ) − ψ x ( ξ ) ⟩ d ξ = 2 1 ∫ p x ( ξ ) [ ⟨ ψ ( ξ ; θ ) , ψ ( ξ ; θ ) ⟩ + ⟨ ψ x ( ξ ) , ψ x ( ξ ) ⟩ − 2 ⟨ ψ ( ξ ; θ ) , ψ x ( ξ ) ⟩ ] d ξ = ∫ p x ( ξ ) [ 2 1 ∥ ψ ( ξ ; θ ) ∥ 2 + 2 1 ∥ ψ x ( ξ ) ∥ 2 − ⟨ ψ ( ξ ; θ ) , ψ x ( ξ ) ⟩ ] d ξ
Examining the integral of only the third term, we have:
− ∑ i ∫ p x ( ξ ) ψ x , i ( ξ ) ψ i ( ξ ; θ ) d ξ
-\sum\limits_{i} \int p_{\mathbf{x}}(\boldsymbol{\xi}) \psi_{\mathbf{x}, i}(\boldsymbol{\xi}) \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi}
− i ∑ ∫ p x ( ξ ) ψ x , i ( ξ ) ψ i ( ξ ; θ ) d ξ
For each i i i , the integral can be rewritten as follows, applying the derivative rule of the logarithm :
− ∫ p x ( ξ ) ψ x , i ( ξ ) ψ i ( ξ ; θ ) d ξ = − ∫ p x ( ξ ) ∂ log p x ( ξ ) ∂ ξ i ψ i ( ξ ; θ ) d ξ = − ∫ p x ( ξ ) ( 1 p x ( ξ ) ∂ p x ( ξ ) ∂ ξ i ) ψ i ( ξ ; θ ) d ξ = − ∫ ∂ p x ( ξ ) ∂ ξ i ψ i ( ξ ; θ ) d ξ
\begin{align*}
-\int p_{\mathbf{x}}(\boldsymbol{\xi}) \psi_{\mathbf{x}, i}(\boldsymbol{\xi}) \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi}
&= -\int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \log p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\
&= -\int p_{\mathbf{x}}(\boldsymbol{\xi})\left( \dfrac{1}{p_{\mathbf{x}}(\boldsymbol{\xi})} \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \right)\psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\
&= -\int \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\
\end{align*}
− ∫ p x ( ξ ) ψ x , i ( ξ ) ψ i ( ξ ; θ ) d ξ = − ∫ p x ( ξ ) ∂ ξ i ∂ log p x ( ξ ) ψ i ( ξ ; θ ) d ξ = − ∫ p x ( ξ ) ( p x ( ξ ) 1 ∂ ξ i ∂ p x ( ξ ) ) ψ i ( ξ ; θ ) d ξ = − ∫ ∂ ξ i ∂ p x ( ξ ) ψ i ( ξ ; θ ) d ξ
This can be further rewritten using the integration by parts :
− ∫ ∂ p x ( ξ ) ∂ ξ i ψ i ( ξ ; θ ) d ξ = − ∫ ( ∫ ∂ p x ( ξ ) ∂ ξ i ψ i ( ξ ; θ ) d ξ 1 ) d ( ξ 2 , … , ξ n ) = − ∫ ( [ p x ( ξ ) ψ i ( ξ ; θ ) ] ξ 1 = − ∞ ∞ − ∫ p x ( ξ ) ∂ ψ i ( ξ ; θ ) ∂ ξ i d ξ 1 ) d ( ξ 2 , … , ξ n ) = ∫ ∫ p x ( ξ ) ∂ ψ i ( ξ ; θ ) ∂ ξ i d ξ 1 d ( ξ 2 , … , ξ n ) = ∫ p x ( ξ ) ∂ ψ i ( ξ ; θ ) ∂ ξ i d ξ = ∫ p x ( ξ ) ∂ i ψ i ( ξ ; θ ) d ξ
\begin{align*}
& -\int \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi} \\
&= -\int \left( \int \dfrac{\partial p_{\mathbf{x}}(\boldsymbol{\xi})}{\partial \xi_{i}} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\xi_{1} \right) \mathrm{d}(\xi_{2},\dots,\xi_{n}) \\
&= -\int \left( \left[p_{\mathbf{x}}(\boldsymbol{\xi}) \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \right]_{\xi_{1}=-\infty}^{\infty} - \int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} \mathrm{d}\xi_{1} \right) \mathrm{d}(\xi_{2},\dots,\xi_{n}) \\
&= \int \int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} \mathrm{d}\xi_{1}\mathrm{d}(\xi_{2},\dots,\xi_{n}) \\
&= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \dfrac{\partial \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})}{\partial \xi_{i}} \mathrm{d}\boldsymbol{\xi} = \int p_{\mathbf{x}}(\boldsymbol{\xi}) \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \mathrm{d}\boldsymbol{\xi}
\end{align*}
− ∫ ∂ ξ i ∂ p x ( ξ ) ψ i ( ξ ; θ ) d ξ = − ∫ ( ∫ ∂ ξ i ∂ p x ( ξ ) ψ i ( ξ ; θ ) d ξ 1 ) d ( ξ 2 , … , ξ n ) = − ∫ ( [ p x ( ξ ) ψ i ( ξ ; θ ) ] ξ 1 = − ∞ ∞ − ∫ p x ( ξ ) ∂ ξ i ∂ ψ i ( ξ ; θ ) d ξ 1 ) d ( ξ 2 , … , ξ n ) = ∫∫ p x ( ξ ) ∂ ξ i ∂ ψ i ( ξ ; θ ) d ξ 1 d ( ξ 2 , … , ξ n ) = ∫ p x ( ξ ) ∂ ξ i ∂ ψ i ( ξ ; θ ) d ξ = ∫ p x ( ξ ) ∂ i ψ i ( ξ ; θ ) d ξ
Here, since p x p_{\mathbf{x}} p x is a probability density function, it must be integrable, leading to lim ξ → ± ∞ p x ( ξ ) = 0 \lim\limits_{\boldsymbol{\xi} \to \pm \infty}p_{\mathbf{x}}(\boldsymbol{\xi}) = 0 ξ → ± ∞ lim p x ( ξ ) = 0 , thus the definite integral value in brackets is 0 0 0 . Substituting and rearranging, we get:
J ( θ ) = ∫ p x ( ξ ) [ 1 2 ∥ ψ ( ξ ; θ ) ∥ 2 + 1 2 ∥ ψ x ( ξ ) ∥ 2 − ⟨ ψ ( ξ ; θ ) , ψ x ( ξ ) ⟩ ] d ξ = ∫ p x ( ξ ) [ 1 2 ∥ ψ ( ξ ; θ ) ∥ 2 + ∑ i ∂ i ψ i ( ξ ; θ ) ] d ξ + 1 2 ∫ p x ( ξ ) ∥ ψ x ( ξ ) ∥ 2 d ξ = ∫ p x ( ξ ) [ 1 2 ∥ ψ ( ξ ; θ ) ∥ 2 + ∑ i ∂ i ψ i ( ξ ; θ ) ] d ξ + 1 2 ∫ p x ( ξ ) ∥ ψ x ( ξ ) ∥ 2 d ξ
\begin{align*}
J(\boldsymbol{\theta})
&= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \dfrac{1}{2}\| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} - \Braket{\psi(\boldsymbol{\xi}; \boldsymbol{\theta}), \psi_{\mathbf{x}}(\boldsymbol{\xi})} \Big] \mathrm{d}\boldsymbol{\xi} \\
&= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \sum\limits_{i}\partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \Big] \mathrm{d}\boldsymbol{\xi} + \dfrac{1}{2}\int p_{\mathbf{x}}(\boldsymbol{\xi}) \| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} \mathrm{d}\boldsymbol{\xi} \\
&= \int p_{\mathbf{x}}(\boldsymbol{\xi}) \Big[ \dfrac{1}{2}\| \psi(\boldsymbol{\xi}; \boldsymbol{\theta})\|^{2} + \sum\limits_{i}\partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \Big] \mathrm{d}\boldsymbol{\xi} + \dfrac{1}{2}\int p_{\mathbf{x}}(\boldsymbol{\xi}) \| \psi_{\mathbf{x}}(\boldsymbol{\xi})\|^{2} \mathrm{d}\boldsymbol{\xi} \\
\end{align*}
J ( θ ) = ∫ p x ( ξ ) [ 2 1 ∥ ψ ( ξ ; θ ) ∥ 2 + 2 1 ∥ ψ x ( ξ ) ∥ 2 − ⟨ ψ ( ξ ; θ ) , ψ x ( ξ ) ⟩ ] d ξ = ∫ p x ( ξ ) [ 2 1 ∥ ψ ( ξ ; θ ) ∥ 2 + i ∑ ∂ i ψ i ( ξ ; θ ) ] d ξ + 2 1 ∫ p x ( ξ ) ∥ ψ x ( ξ ) ∥ 2 d ξ = ∫ p x ( ξ ) [ 2 1 ∥ ψ ( ξ ; θ ) ∥ 2 + i ∑ ∂ i ψ i ( ξ ; θ ) ] d ξ + 2 1 ∫ p x ( ξ ) ∥ ψ x ( ξ ) ∥ 2 d ξ
The last term is a constant not depending on θ \boldsymbol{\theta} θ . Hence, we obtain:
J ( θ ) = ∫ p x ( ξ ) ∑ i [ 1 2 ψ i ( ξ ; θ ) 2 + ∂ i ψ i ( ξ ; θ ) ] d ξ + constant
J(\boldsymbol{\theta}) = \int p_{\mathbf{x}}(\boldsymbol{\xi}) \sum\limits_{i}\Big[ \frac{1}{2} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta})^{2} + \partial_{i} \psi_{i}(\boldsymbol{\xi}; \boldsymbol{\theta}) \Big] \mathrm{d}\boldsymbol{\xi} + \text{constant}
J ( θ ) = ∫ p x ( ξ ) i ∑ [ 2 1 ψ i ( ξ ; θ ) 2 + ∂ i ψ i ( ξ ; θ ) ] d ξ + constant
■
Appendix B. Proof of Theorem 2 J ( θ ) = 0 ⟺ θ = θ ∗
J(\boldsymbol{\theta}) = 0 \iff \boldsymbol{\theta} = \boldsymbol{\theta}^{\ast}
J ( θ ) = 0 ⟺ θ = θ ∗
( ⟹ \implies ⟹ )
Assume J ( θ ) = 0 J(\boldsymbol{\theta}) = 0 J ( θ ) = 0 .
J = ∫ p x ( ξ ) ∥ ψ ( ξ ; θ ) − ψ x ( ξ ) ∥ 2 d ξ = 0 = ⟨ p x , ∥ ψ − ψ x ∥ 2 ⟩
J = \int p_{x}(\boldsymbol{\xi}) \| \psi(\boldsymbol{\xi}; \boldsymbol{\theta}) - \psi_{\mathbf{x}}(\boldsymbol{\xi}) \|^{2} \mathrm{d}\boldsymbol{\xi} = 0 = \left\langle p_{x}, \| \psi - \psi_{\mathbf{x}} \|^{2} \right\rangle
J = ∫ p x ( ξ ) ∥ ψ ( ξ ; θ ) − ψ x ( ξ ) ∥ 2 d ξ = 0 = ⟨ p x , ∥ ψ − ψ x ∥ 2 ⟩
For all ξ \boldsymbol{\xi} ξ , since q > 0 q \gt 0 q > 0 , p ( ξ ) > 0 p(\boldsymbol{\xi}) \gt 0 p ( ξ ) > 0 is ∀ ξ \forall \boldsymbol{\xi} ∀ ξ , and since ∥ ψ − ψ x ∥ 2 ≥ 0 \| \psi - \psi_{\mathbf{x}} \|^{2} \ge 0 ∥ ψ − ψ x ∥ 2 ≥ 0 , for all ξ \boldsymbol{\xi} ξ , it must be ∥ ψ − ψ x ∥ 2 = 0 \| \psi - \psi_{\mathbf{x}} \|^{2} = 0 ∥ ψ − ψ x ∥ 2 = 0 .
J = 0 = ⟨ p x , ∥ ψ − ψ x ∥ 2 ⟩ ⟹ ∥ ψ − ψ x ∥ 2 = 0 ⟹ ψ = ψ x
J = 0 = \left\langle p_{x}, \| \psi - \psi_{\mathbf{x}} \|^{2} \right\rangle \implies \| \psi - \psi_{\mathbf{x}} \|^{2} = 0 \implies \psi = \psi_{\mathbf{x}}
J = 0 = ⟨ p x , ∥ ψ − ψ x ∥ 2 ⟩ ⟹ ∥ ψ − ψ x ∥ 2 = 0 ⟹ ψ = ψ x
This implies the following:
ψ x = ψ ⟹ log p x ( ⋅ ) = log p ( ⋅ ; θ ) + constant
\psi_{\mathbf{x}} = \psi \implies \log p_{\mathbf{x}}(\cdot) = \log p( \cdot; \boldsymbol{\theta}) + \text{constant}
ψ x = ψ ⟹ log p x ( ⋅ ) = log p ( ⋅ ; θ ) + constant
Given p p p , probability density function p x p_{\mathbf{x}} p x should integrate to 1 1 1 , meaning the constant must be 0 0 0 . Hence, p x ( ⋅ ) = p ( ⋅ ; θ ) p_{\mathbf{x}}(\cdot) = p( \cdot; \boldsymbol{\theta}) p x ( ⋅ ) = p ( ⋅ ; θ ) . By assumption, θ = θ ∗ \boldsymbol{\theta} = \boldsymbol{\theta}^{\ast} θ = θ ∗ .
■
( ⟸ \impliedby ⟸ )
It is obvious.
■