logo

Proof of Pearson's Theorem 📂Probability Distribution

Proof of Pearson's Theorem

Theorem

Let’s say that the random vector $\left( N_{1} , \cdots , N_{k} \right)$ follows the multinomial distribution $M_{k} \left( n ; \mathbf{p} \right)$ for $\mathbf{p} = \left( p_{1} , \cdots , p_{k} \right) \in [0,1]^{k}$ that satisfies $$ \sum_{i=1}^{k} N_{i} = n \qquad \& \qquad \sum_{i=1}^{k} p_{i} = 1 $$, and the sample sizes $n \in \mathbb{N}$ and $k \in \mathbb{N}$ categories. Then, when $n \to \infty$, the statistic $S$ converges in distribution to the chi-squared distribution $\chi^{2} \left( k - 1 \right)$. $$ S := \sum_{j=1}^{k} {{ \left( N_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} \overset{D}{\to} \chi^{2} \left( k-1 \right) $$


  • $[0,1]^{k} = [0,1] \times \cdots \times [0,1]$ is a $k$-cell.
  • $\overset{D}{\to}$ means convergence in distribution.
  • $\chi^{2} \left( r \right)$ refers to the chi-squared distribution with degrees of freedom $r$.

Description

Strictly speaking, the term Pearson’s theorem is not often used. Honestly, I’ve only seen it once1, and it’s usually referred to simply as the Pearson Chi-squared Statistic $$ \mathcal{X}^{2} := \sum_{j=1}^{k} {{ \left( O_{j} - E_{j} \right)^{2} } \over { E_{j} }} $$ converging to the chi-squared distribution. This is because the conversation often moves on to practical discussions of hypothesis testing rather than studying it as a “theorem” per se, suggesting a tendency to neglect rigorous mathematical proofs.

Proof 2

Let’s say that all components of $\mathbf{p}$ are $p_{j} > 0$, and $$ S := \sum_{j=1}^{k} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} $$ Given the constraint $\sum_{j=1}^{k} \left( N_{j} - n p_{j} \right) = 0$, excluding the last $k$th term yields $$ \begin{align*} S =& \sum_{j=1}^{k} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} \\ =& \sum_{j=1}^{k-1} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} + {{ \left( X_{k} - n p_{k} \right)^{2} } \over { n p_{k} }} \\ =& \sum_{j=1}^{k-1} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} + {{ \left( \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) \right)^{2} } \over { n p_{k} }} \end{align*} $$ In fact, if there’s a constraint, $N_{k}$ is not necessary, and for $\widetilde{N} := \left( N_{1} , \cdots , N_{k-1} \right)$ and $\widetilde{\mathbf{p}} := \left( p_{1} , \cdots , p_{k-1} \right)$ without the $k$th component, consider the covariance matrix $\widetilde{\Sigma}$ for $\widetilde{N}$.

Covariance matrix of the multinomial distribution: If the random vector $\mathbf{X} := \left( X_{1} , \cdots , X_{k} \right)$ follows the multinomial distribution $M_{k} \left( n, \mathbf{p} \right)$, then the covariance matrix is as follows: $$ \operatorname{Cov} \left( \mathbf{X} \right) = n \begin{bmatrix} p_{1} \left( 1 - p_{1} \right) & - p_{1} p_{2} & \cdots & - p_{1} p_{k} \\ - p_{2} p_{1} & p_{2} \left( 1 - p_{2} \right) & \cdots & - p_{2} p_{2} \\ \vdots & \vdots & \ddots & \vdots \\ - p_{k} p_{1} & - p_{k} p_{2} & \cdots & p_{k} \left( 1 - p_{k} \right) \end{bmatrix} $$

Whether there’s a $k$th component or not, there’s no reason for $\operatorname{Cov} \left( N_{i} , N_{j} \right)$ to change, thus the following is obtained for the identity matrix $I_{k-1}$: $$ \begin{align*} & {{ 1 } \over { n }} \widetilde{\Sigma} \\ =& {{ 1 } \over { n }} \operatorname{Cov} \left( \widetilde{N} \right) \\ =& \begin{bmatrix} p_{1} \left( 1 - p_{1} \right) & - p_{1} p_{2} & \cdots & - p_{1} p_{k-1} \\ - p_{2} p_{1} & p_{2} \left( 1 - p_{2} \right) & \cdots & - p_{2} p_{2} \\ \vdots & \vdots & \ddots & \vdots \\ - p_{k-1} p_{1} & - p_{k-1} p_{2} & \cdots & p_{k-1} \left( 1 - p_{k-1} \right) \end{bmatrix} \\ =& \begin{bmatrix} p_{1} & 0 & \cdots & 0 \\ 0 & p_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p_{k-1} \end{bmatrix} - \begin{bmatrix} - p_{1}^{2} & - p_{1} p_{2} & \cdots & - p_{1} p_{k-1} \\ - p_{2} p_{1} & - p_{2}^{2} & \cdots & - p_{2} p_{2} \\ \vdots & \vdots & \ddots & \vdots \\ - p_{k-1} p_{1} & - p_{k-1} p_{2} & \cdots & - p_{k-1}^{2} \end{bmatrix} \\ =& \begin{bmatrix} p_{1} & 0 & \cdots & 0 \\ 0 & p_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p_{k-1} \end{bmatrix} - \begin{bmatrix} p_{1} \\ p_{2} \\ \vdots \\ p_{k-1} \end{bmatrix} \begin{bmatrix} p_{1} & p_{2} & \cdots & p_{k-1} \end{bmatrix} \\ =& \begin{bmatrix} p_{1} & 0 & \cdots & 0 \\ 0 & p_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p_{k-1} \end{bmatrix} - \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \\ =& I_{k-1} \widetilde{\mathbf{p}} - \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \end{align*} $$ Here, $\widetilde{\mathbf{p}}^{T}$ represents the transpose of $\widetilde{\mathbf{p}}$. Now, let’s set it as $\widetilde{P} := I_{k-1} \widetilde{\mathbf{p}}$.

Sherman–Morrison formula: When $\left( A + \mathbf{u} \mathbf{v}^{T} \right)^{-1}$ exists, the specific formula is as follows: $$ \left( A + \mathbf{u} \mathbf{v}^{T} \right)^{-1} = A^{-1} - {{ A^{-1} \mathbf{u} \mathbf{v}^{T} A^{-1} } \over { 1 + \mathbf{v}^{T} A^{-1} \mathbf{u} }} $$

Calculating the inverse of ${{ 1 } \over { n }} \widetilde{\Sigma}$, since the inverse of the diagonal matrix $$ \widetilde{P} = I_{k-1} \widetilde{\mathbf{p}} = \text{diag} \left( p_{1} , \cdots , p_{k-1} \right) $$ is the diagonal matrix $\widetilde{P}^{-1} = \text{diag} \left( p_{1}^{-1} , \cdots , p_{k-1}^{-1} \right)$ with reciprocal diagonal components, according to the Sherman-Morrison formula, $$ \begin{align*} & \left( {{ 1 } \over { n }} \widetilde{\Sigma} \right)^{-1} \\ =& \left( \widetilde{P} + \left( - \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \right) \right)^{-1} \\ =& \widetilde{P}^{-1} + {{ \widetilde{P}^{-1} \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \widetilde{P}^{-1} } \over { 1 - \widetilde{\mathbf{p}}^{T} \widetilde{P}^{-1} \widetilde{\mathbf{p}} }} \\ =& \widetilde{P}^{-1} + {{ I_{k-1} I_{k-1} } \over { 1 - I_{k-1} \widetilde{\mathbf{p}} }} \\ =& \widetilde{P}^{-1} + {{ I_{k-1} I_{k-1} } \over { 1 - I_{k-1} \widetilde{\mathbf{p}} }} \\ =& \widetilde{P}^{-1} + {{ 1 } \over { 1 - p_{1} - \cdots - p_{k-1} }} I_{k-1} \\ =& \widetilde{P}^{-1} + {{ 1 } \over { p_{k} }} I_{k-1} \\ =& \begin{bmatrix} {{ 1 } \over { p_{1} }} + {{ 1 } \over { p_{k} }} & 0 & 0 & \cdots & 0 \\ 0 & {{ 1 } \over { p_{2} }} + {{ 1 } \over { p_{k} }} & 0 & \cdots & 0 \\ 0 & 0 & {{ 1 } \over { p_{3} }} + {{ 1 } \over { p_{k} }} & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & {{ 1 } \over { p_{k-1} }} + {{ 1 } \over { p_{k} }} \end{bmatrix} \end{align*} $$ Returning to $S$ and expressing it in matrix form, $$ \begin{align*} S =& \sum_{j=1}^{k-1} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} + {{ \left( \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) \right)^{2} } \over { n p_{k} }} \\ =& {{ 1 } \over { n }} \left[ \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) {{ 1 } \over { p_{j} }} \left( X_{j} - n p_{j} \right) + \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) {{ 1 } \over { p_{k} }} \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) \right] \\ =& {{ 1 } \over { n }} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right)^{T} \left( {{ 1 } \over { n }} \widetilde{\Sigma} \right)^{-1} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \\ =& \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right)^{T} \left( \widetilde{\Sigma}^{-1} \right) \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \end{align*} $$

De Moivre–Laplace theorem: If $X_i \sim B(1,p)$ and $Y_n = X_1 + X_2 + \cdots + X_n$, then $Y_n \sim B(n,p)$, and $$ { { Y_n - np } \over {\sqrt{ np(1-p) } } }\overset{D}{\to} N(0,1) $$

Since we already know that $\widetilde{\Sigma}$ is the covariance matrix of $\widetilde{N}$, $$ \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \sim \left( \mathbf{0} , \widetilde{\Sigma} \right) $$ and according to the De Moivre–Laplace theorem, for the multivariate normal distribution $\mathcal{N}_{k-1}$, it can be expressed as follows: $$ \widetilde{\Sigma}^{ - 1/2} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \overset{D}{\to} \mathcal{N}_{k-1} \left( \mathbf{0} , I_{k-1} \right) $$ At this point, each component that converges in distribution to follow the standard normal distribution is denoted by $Z_{j} \sim N \left( 0, 1 \right)$, and let’s define the characteristic function of $S$ as $\phi$. Since the covariance matrix of $\left( Z_{1} , \cdots ,Z_{k-1} \right)$ is $I_{k-1}$, when $i \ne j$, both $Z_{i}$ and $Z_{j}$ are of course independent.

Lévy’s continuity theorem: Given a measurable space $\left( \mathbb{R}^{d} , \mathcal{B} \left( \mathbb{R}^{d} \right) \right)$, let’s denote the probability measure for $n \in \overline{\mathbb{N}}$ by $\mu_{n}$, and the corresponding characteristic function by $\varphi_{n}$. The following are equivalent:

  • (a): $\mu_{n}$ weakly converges to $\mu_{\infty}$.
  • (b): For all $t \in \mathbb{R}^{d}$, $$\lim_{n \to \infty} \varphi_{n} (t) = \varphi_{\infty} (t)$$

$\phi$, when $n \to \infty$, according to Lévy’s continuity theorem for $Z \sim N (0,1)$, $$ \begin{align*} \phi (t) =& E \left[ e^{itS} \right] \\ =& E \left[ \exp \left( it \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right)^{T} \widetilde{\Sigma}^{ - 1} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \right) \right] \\ \overset{D}{\to} & E \left[ \exp \left( it \left( Z^{2}_{1} + \cdots + Z^{2}_{k-1} \right) \right) \right] \\ = & \left[ E \left[ \exp \left( it Z^{2} \right) \right] \right]^{k-1} \end{align*} $$

Properties of the chi-squared distribution:

$$ \begin{align*} \phi (t) \overset{D}{\to} & \left[ E \left[ \exp \left( it Z^{2} \right) \right] \right]^{k-1} \\ =& \left[ {{ 1 } \over { \left( 1 - 2it \right)^{1/2} }} \right]^{k-1} \\ =& (1-2it)^{-(k-1)/2} \end{align*} $$ Therefore, $S$ converges in distribution to the chi-squared distribution $\chi^{2} \left( k-1 \right)$.


  1. https://ocw.mit.edu/courses/18-443-statistics-for-applications-fall-2003/708680f9de8209158ca6462577a46a56_lec23.pdf ↩︎

  2. Benhamou. (2018). Seven proofs of the Pearson Chi-squared independence test and its graphical interpretation: https://arxiv.org/abs/1808.09171 ↩︎