Proof of Pearson's Theorem 📂Probability Distribution

Proof of Pearson's Theorem

Theorem

Let’s say that the random vector $\left( N_{1} , \cdots , N_{k} \right)$ follows the multinomial distribution $M_{k} \left( n ; \mathbf{p} \right)$ for $\mathbf{p} = \left( p_{1} , \cdots , p_{k} \right) \in [0,1]^{k}$ that satisfies $\sum_{i=1}^{k} N_{i} = n \qquad \& \qquad \sum_{i=1}^{k} p_{i} = 1$ , and the sample sizes $n \in \mathbb{N}$ and $k \in \mathbb{N}$ categories. Then, when $n \to \infty$ , the statistic $S$ converges in distribution to the chi-squared distribution $\chi^{2} \left( k - 1 \right)$ . $S := \sum_{j=1}^{k} {{ \left( N_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} \overset{D}{\to} \chi^{2} \left( k-1 \right)$

$[0,1]^{k} = [0,1] \times \cdots \times [0,1]$ is a $k$ -cell.
$\overset{D}{\to}$ means convergence in distribution.
$\chi^{2} \left( r \right)$ refers to the chi-squared distribution with degrees of freedom $r$ .

Description

Strictly speaking, the term Pearson’s theorem is not often used. Honestly, I’ve only seen it once¹, and it’s usually referred to simply as the Pearson Chi-squared Statistic $\mathcal{X}^{2} := \sum_{j=1}^{k} {{ \left( O_{j} - E_{j} \right)^{2} } \over { E_{j} }}$ converging to the chi-squared distribution. This is because the conversation often moves on to practical discussions of hypothesis testing rather than studying it as a “theorem” per se, suggesting a tendency to neglect rigorous mathematical proofs.

Proof ²

Let’s say that all components of $\mathbf{p}$ are $p_{j} > 0$ , and $S := \sum_{j=1}^{k} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }}$ Given the constraint $\sum_{j=1}^{k} \left( N_{j} - n p_{j} \right) = 0$ , excluding the last $k$ th term yields $\begin{align*} S =& \sum_{j=1}^{k} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} \\ =& \sum_{j=1}^{k-1} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} + {{ \left( X_{k} - n p_{k} \right)^{2} } \over { n p_{k} }} \\ =& \sum_{j=1}^{k-1} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} + {{ \left( \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) \right)^{2} } \over { n p_{k} }} \end{align*}$ In fact, if there’s a constraint, $N_{k}$ is not necessary, and for $\widetilde{N} := \left( N_{1} , \cdots , N_{k-1} \right)$ and $\widetilde{\mathbf{p}} := \left( p_{1} , \cdots , p_{k-1} \right)$ without the $k$ th component, consider the covariance matrix $\widetilde{\Sigma}$ for $\widetilde{N}$ .

Covariance matrix of the multinomial distribution: If the random vector $\mathbf{X} := \left( X_{1} , \cdots , X_{k} \right)$ follows the multinomial distribution $M_{k} \left( n, \mathbf{p} \right)$ , then the covariance matrix is as follows: $\operatorname{Cov} \left( \mathbf{X} \right) = n \begin{bmatrix} p_{1} \left( 1 - p_{1} \right) & - p_{1} p_{2} & \cdots & - p_{1} p_{k} \\ - p_{2} p_{1} & p_{2} \left( 1 - p_{2} \right) & \cdots & - p_{2} p_{2} \\ \vdots & \vdots & \ddots & \vdots \\ - p_{k} p_{1} & - p_{k} p_{2} & \cdots & p_{k} \left( 1 - p_{k} \right) \end{bmatrix}$

Whether there’s a $k$ th component or not, there’s no reason for $\operatorname{Cov} \left( N_{i} , N_{j} \right)$ to change, thus the following is obtained for the identity matrix $I_{k-1}$ : $\begin{align*} & {{ 1 } \over { n }} \widetilde{\Sigma} \\ =& {{ 1 } \over { n }} \operatorname{Cov} \left( \widetilde{N} \right) \\ =& \begin{bmatrix} p_{1} \left( 1 - p_{1} \right) & - p_{1} p_{2} & \cdots & - p_{1} p_{k-1} \\ - p_{2} p_{1} & p_{2} \left( 1 - p_{2} \right) & \cdots & - p_{2} p_{2} \\ \vdots & \vdots & \ddots & \vdots \\ - p_{k-1} p_{1} & - p_{k-1} p_{2} & \cdots & p_{k-1} \left( 1 - p_{k-1} \right) \end{bmatrix} \\ =& \begin{bmatrix} p_{1} & 0 & \cdots & 0 \\ 0 & p_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p_{k-1} \end{bmatrix} - \begin{bmatrix} - p_{1}^{2} & - p_{1} p_{2} & \cdots & - p_{1} p_{k-1} \\ - p_{2} p_{1} & - p_{2}^{2} & \cdots & - p_{2} p_{2} \\ \vdots & \vdots & \ddots & \vdots \\ - p_{k-1} p_{1} & - p_{k-1} p_{2} & \cdots & - p_{k-1}^{2} \end{bmatrix} \\ =& \begin{bmatrix} p_{1} & 0 & \cdots & 0 \\ 0 & p_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p_{k-1} \end{bmatrix} - \begin{bmatrix} p_{1} \\ p_{2} \\ \vdots \\ p_{k-1} \end{bmatrix} \begin{bmatrix} p_{1} & p_{2} & \cdots & p_{k-1} \end{bmatrix} \\ =& \begin{bmatrix} p_{1} & 0 & \cdots & 0 \\ 0 & p_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p_{k-1} \end{bmatrix} - \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \\ =& I_{k-1} \widetilde{\mathbf{p}} - \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \end{align*}$ Here, $\widetilde{\mathbf{p}}^{T}$ represents the transpose of $\widetilde{\mathbf{p}}$ . Now, let’s set it as $\widetilde{P} := I_{k-1} \widetilde{\mathbf{p}}$ .

Sherman–Morrison formula: When $\left( A + \mathbf{u} \mathbf{v}^{T} \right)^{-1}$ exists, the specific formula is as follows: $\left( A + \mathbf{u} \mathbf{v}^{T} \right)^{-1} = A^{-1} - {{ A^{-1} \mathbf{u} \mathbf{v}^{T} A^{-1} } \over { 1 + \mathbf{v}^{T} A^{-1} \mathbf{u} }}$

Calculating the inverse of ${{ 1 } \over { n }} \widetilde{\Sigma}$ , since the inverse of the diagonal matrix $\widetilde{P} = I_{k-1} \widetilde{\mathbf{p}} = \text{diag} \left( p_{1} , \cdots , p_{k-1} \right)$ is the diagonal matrix $\widetilde{P}^{-1} = \text{diag} \left( p_{1}^{-1} , \cdots , p_{k-1}^{-1} \right)$ with reciprocal diagonal components, according to the Sherman-Morrison formula, $\begin{align*} & \left( {{ 1 } \over { n }} \widetilde{\Sigma} \right)^{-1} \\ =& \left( \widetilde{P} + \left( - \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \right) \right)^{-1} \\ =& \widetilde{P}^{-1} + {{ \widetilde{P}^{-1} \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \widetilde{P}^{-1} } \over { 1 - \widetilde{\mathbf{p}}^{T} \widetilde{P}^{-1} \widetilde{\mathbf{p}} }} \\ =& \widetilde{P}^{-1} + {{ I_{k-1} I_{k-1} } \over { 1 - I_{k-1} \widetilde{\mathbf{p}} }} \\ =& \widetilde{P}^{-1} + {{ I_{k-1} I_{k-1} } \over { 1 - I_{k-1} \widetilde{\mathbf{p}} }} \\ =& \widetilde{P}^{-1} + {{ 1 } \over { 1 - p_{1} - \cdots - p_{k-1} }} I_{k-1} \\ =& \widetilde{P}^{-1} + {{ 1 } \over { p_{k} }} I_{k-1} \\ =& \begin{bmatrix} {{ 1 } \over { p_{1} }} + {{ 1 } \over { p_{k} }} & 0 & 0 & \cdots & 0 \\ 0 & {{ 1 } \over { p_{2} }} + {{ 1 } \over { p_{k} }} & 0 & \cdots & 0 \\ 0 & 0 & {{ 1 } \over { p_{3} }} + {{ 1 } \over { p_{k} }} & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & {{ 1 } \over { p_{k-1} }} + {{ 1 } \over { p_{k} }} \end{bmatrix} \end{align*}$ Returning to $S$ and expressing it in matrix form, $\begin{align*} S =& \sum_{j=1}^{k-1} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} + {{ \left( \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) \right)^{2} } \over { n p_{k} }} \\ =& {{ 1 } \over { n }} \left[ \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) {{ 1 } \over { p_{j} }} \left( X_{j} - n p_{j} \right) + \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) {{ 1 } \over { p_{k} }} \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) \right] \\ =& {{ 1 } \over { n }} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right)^{T} \left( {{ 1 } \over { n }} \widetilde{\Sigma} \right)^{-1} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \\ =& \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right)^{T} \left( \widetilde{\Sigma}^{-1} \right) \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \end{align*}$

De Moivre–Laplace theorem: If $X_i \sim B(1,p)$ and $Y_n = X_1 + X_2 + \cdots + X_n$ , then $Y_n \sim B(n,p)$ , and ${ { Y_n - np } \over {\sqrt{ np(1-p) } } }\overset{D}{\to} N(0,1)$

Since we already know that $\widetilde{\Sigma}$ is the covariance matrix of $\widetilde{N}$ , $\left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \sim \left( \mathbf{0} , \widetilde{\Sigma} \right)$ and according to the De Moivre–Laplace theorem, for the multivariate normal distribution $\mathcal{N}_{k-1}$ , it can be expressed as follows: $\widetilde{\Sigma}^{ - 1/2} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \overset{D}{\to} \mathcal{N}_{k-1} \left( \mathbf{0} , I_{k-1} \right)$ At this point, each component that converges in distribution to follow the standard normal distribution is denoted by $Z_{j} \sim N \left( 0, 1 \right)$ , and let’s define the characteristic function of $S$ as $\phi$ . Since the covariance matrix of $\left( Z_{1} , \cdots ,Z_{k-1} \right)$ is $I_{k-1}$ , when $i \ne j$ , both $Z_{i}$ and $Z_{j}$ are of course independent.

Lévy’s continuity theorem: Given a measurable space $\left( \mathbb{R}^{d} , \mathcal{B} \left( \mathbb{R}^{d} \right) \right)$ , let’s denote the probability measure for $n \in \overline{\mathbb{N}}$ by $\mu_{n}$ , and the corresponding characteristic function by $\varphi_{n}$ . The following are equivalent:
(a): $\mu_{n}$ weakly converges to $\mu_{\infty}$ .
(b): For all $t \in \mathbb{R}^{d}$ , $\lim_{n \to \infty} \varphi_{n} (t) = \varphi_{\infty} (t)$

$\phi$ , when $n \to \infty$ , according to Lévy’s continuity theorem for $Z \sim N (0,1)$ , $\begin{align*} \phi (t) =& E \left[ e^{itS} \right] \\ =& E \left[ \exp \left( it \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right)^{T} \widetilde{\Sigma}^{ - 1} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \right) \right] \\ \overset{D}{\to} & E \left[ \exp \left( it \left( Z^{2}_{1} + \cdots + Z^{2}_{k-1} \right) \right) \right] \\ = & \left[ E \left[ \exp \left( it Z^{2} \right) \right] \right]^{k-1} \end{align*}$

Properties of the chi-squared distribution:
Relation with the square of the standard normal distribution: If $X \sim N(\mu,\sigma ^2)$ , then $V=\left( { X - \mu \over \sigma} \right) ^2 \sim \chi ^2 (1)$
Moment-generating function: $m(t) = (1-2t)^{-r/2} \qquad , t < {{ 1 } \over { 2 }}$

$\begin{align*} \phi (t) \overset{D}{\to} & \left[ E \left[ \exp \left( it Z^{2} \right) \right] \right]^{k-1} \\ =& \left[ {{ 1 } \over { \left( 1 - 2it \right)^{1/2} }} \right]^{k-1} \\ =& (1-2it)^{-(k-1)/2} \end{align*}$ Therefore, $S$ converges in distribution to the chi-squared distribution $\chi^{2} \left( k-1 \right)$ .

■

https://ocw.mit.edu/courses/18-443-statistics-for-applications-fall-2003/708680f9de8209158ca6462577a46a56_lec23.pdf ↩︎
Benhamou. (2018). Seven proofs of the Pearson Chi-squared independence test and its graphical interpretation: https://arxiv.org/abs/1808.09171 ↩︎