logo

Proof of Pearson's Theorem 📂Probability Distribution

Proof of Pearson's Theorem

Theorem

Let’s say that the random vector (N1,,Nk)\left( N_{1} , \cdots , N_{k} \right) follows the multinomial distribution Mk(n;p)M_{k} \left( n ; \mathbf{p} \right) for p=(p1,,pk)[0,1]k\mathbf{p} = \left( p_{1} , \cdots , p_{k} \right) \in [0,1]^{k} that satisfies i=1kNi=n&i=1kpi=1 \sum_{i=1}^{k} N_{i} = n \qquad \& \qquad \sum_{i=1}^{k} p_{i} = 1 , and the sample sizes nNn \in \mathbb{N} and kNk \in \mathbb{N} categories. Then, when nn \to \infty, the statistic SS converges in distribution to the chi-squared distribution χ2(k1)\chi^{2} \left( k - 1 \right). S:=j=1k(Njnpj)2npjDχ2(k1) S := \sum_{j=1}^{k} {{ \left( N_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} \overset{D}{\to} \chi^{2} \left( k-1 \right)


  • [0,1]k=[0,1]××[0,1][0,1]^{k} = [0,1] \times \cdots \times [0,1] is a kk-cell.
  • D\overset{D}{\to} means convergence in distribution.
  • χ2(r)\chi^{2} \left( r \right) refers to the chi-squared distribution with degrees of freedom rr.

Description

Strictly speaking, the term Pearson’s theorem is not often used. Honestly, I’ve only seen it once1, and it’s usually referred to simply as the Pearson Chi-squared Statistic X2:=j=1k(OjEj)2Ej \mathcal{X}^{2} := \sum_{j=1}^{k} {{ \left( O_{j} - E_{j} \right)^{2} } \over { E_{j} }} converging to the chi-squared distribution. This is because the conversation often moves on to practical discussions of hypothesis testing rather than studying it as a “theorem” per se, suggesting a tendency to neglect rigorous mathematical proofs.

Proof 2

Let’s say that all components of p\mathbf{p} are pj>0p_{j} > 0, and S:=j=1k(Xjnpj)2npj S := \sum_{j=1}^{k} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} Given the constraint j=1k(Njnpj)=0\sum_{j=1}^{k} \left( N_{j} - n p_{j} \right) = 0, excluding the last kkth term yields S=j=1k(Xjnpj)2npj=j=1k1(Xjnpj)2npj+(Xknpk)2npk=j=1k1(Xjnpj)2npj+(j=1k1(Xjnpj))2npk \begin{align*} S =& \sum_{j=1}^{k} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} \\ =& \sum_{j=1}^{k-1} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} + {{ \left( X_{k} - n p_{k} \right)^{2} } \over { n p_{k} }} \\ =& \sum_{j=1}^{k-1} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} + {{ \left( \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) \right)^{2} } \over { n p_{k} }} \end{align*} In fact, if there’s a constraint, NkN_{k} is not necessary, and for N~:=(N1,,Nk1)\widetilde{N} := \left( N_{1} , \cdots , N_{k-1} \right) and p~:=(p1,,pk1)\widetilde{\mathbf{p}} := \left( p_{1} , \cdots , p_{k-1} \right) without the kkth component, consider the covariance matrix Σ~\widetilde{\Sigma} for N~\widetilde{N}.

Covariance matrix of the multinomial distribution: If the random vector X:=(X1,,Xk)\mathbf{X} := \left( X_{1} , \cdots , X_{k} \right) follows the multinomial distribution Mk(n,p)M_{k} \left( n, \mathbf{p} \right), then the covariance matrix is as follows: Cov(X)=n[p1(1p1)p1p2p1pkp2p1p2(1p2)p2p2pkp1pkp2pk(1pk)] \operatorname{Cov} \left( \mathbf{X} \right) = n \begin{bmatrix} p_{1} \left( 1 - p_{1} \right) & - p_{1} p_{2} & \cdots & - p_{1} p_{k} \\ - p_{2} p_{1} & p_{2} \left( 1 - p_{2} \right) & \cdots & - p_{2} p_{2} \\ \vdots & \vdots & \ddots & \vdots \\ - p_{k} p_{1} & - p_{k} p_{2} & \cdots & p_{k} \left( 1 - p_{k} \right) \end{bmatrix}

Whether there’s a kkth component or not, there’s no reason for Cov(Ni,Nj)\operatorname{Cov} \left( N_{i} , N_{j} \right) to change, thus the following is obtained for the identity matrix Ik1I_{k-1}: 1nΣ~=1nCov(N~)=[p1(1p1)p1p2p1pk1p2p1p2(1p2)p2p2pk1p1pk1p2pk1(1pk1)]=[p1000p2000pk1][p12p1p2p1pk1p2p1p22p2p2pk1p1pk1p2pk12]=[p1000p2000pk1][p1p2pk1][p1p2pk1]=[p1000p2000pk1]p~p~T=Ik1p~p~p~T \begin{align*} & {{ 1 } \over { n }} \widetilde{\Sigma} \\ =& {{ 1 } \over { n }} \operatorname{Cov} \left( \widetilde{N} \right) \\ =& \begin{bmatrix} p_{1} \left( 1 - p_{1} \right) & - p_{1} p_{2} & \cdots & - p_{1} p_{k-1} \\ - p_{2} p_{1} & p_{2} \left( 1 - p_{2} \right) & \cdots & - p_{2} p_{2} \\ \vdots & \vdots & \ddots & \vdots \\ - p_{k-1} p_{1} & - p_{k-1} p_{2} & \cdots & p_{k-1} \left( 1 - p_{k-1} \right) \end{bmatrix} \\ =& \begin{bmatrix} p_{1} & 0 & \cdots & 0 \\ 0 & p_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p_{k-1} \end{bmatrix} - \begin{bmatrix} - p_{1}^{2} & - p_{1} p_{2} & \cdots & - p_{1} p_{k-1} \\ - p_{2} p_{1} & - p_{2}^{2} & \cdots & - p_{2} p_{2} \\ \vdots & \vdots & \ddots & \vdots \\ - p_{k-1} p_{1} & - p_{k-1} p_{2} & \cdots & - p_{k-1}^{2} \end{bmatrix} \\ =& \begin{bmatrix} p_{1} & 0 & \cdots & 0 \\ 0 & p_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p_{k-1} \end{bmatrix} - \begin{bmatrix} p_{1} \\ p_{2} \\ \vdots \\ p_{k-1} \end{bmatrix} \begin{bmatrix} p_{1} & p_{2} & \cdots & p_{k-1} \end{bmatrix} \\ =& \begin{bmatrix} p_{1} & 0 & \cdots & 0 \\ 0 & p_{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & p_{k-1} \end{bmatrix} - \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \\ =& I_{k-1} \widetilde{\mathbf{p}} - \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \end{align*} Here, p~T\widetilde{\mathbf{p}}^{T} represents the transpose of p~\widetilde{\mathbf{p}}. Now, let’s set it as P~:=Ik1p~\widetilde{P} := I_{k-1} \widetilde{\mathbf{p}}.

Sherman–Morrison formula: When (A+uvT)1\left( A + \mathbf{u} \mathbf{v}^{T} \right)^{-1} exists, the specific formula is as follows: (A+uvT)1=A1A1uvTA11+vTA1u \left( A + \mathbf{u} \mathbf{v}^{T} \right)^{-1} = A^{-1} - {{ A^{-1} \mathbf{u} \mathbf{v}^{T} A^{-1} } \over { 1 + \mathbf{v}^{T} A^{-1} \mathbf{u} }}

Calculating the inverse of 1nΣ~{{ 1 } \over { n }} \widetilde{\Sigma}, since the inverse of the diagonal matrix P~=Ik1p~=diag(p1,,pk1) \widetilde{P} = I_{k-1} \widetilde{\mathbf{p}} = \text{diag} \left( p_{1} , \cdots , p_{k-1} \right) is the diagonal matrix P~1=diag(p11,,pk11)\widetilde{P}^{-1} = \text{diag} \left( p_{1}^{-1} , \cdots , p_{k-1}^{-1} \right) with reciprocal diagonal components, according to the Sherman-Morrison formula, (1nΣ~)1=(P~+(p~p~T))1=P~1+P~1p~p~TP~11p~TP~1p~=P~1+Ik1Ik11Ik1p~=P~1+Ik1Ik11Ik1p~=P~1+11p1pk1Ik1=P~1+1pkIk1=[1p1+1pk00001p2+1pk00001p3+1pk00001pk1+1pk] \begin{align*} & \left( {{ 1 } \over { n }} \widetilde{\Sigma} \right)^{-1} \\ =& \left( \widetilde{P} + \left( - \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \right) \right)^{-1} \\ =& \widetilde{P}^{-1} + {{ \widetilde{P}^{-1} \widetilde{\mathbf{p}} \widetilde{\mathbf{p}}^{T} \widetilde{P}^{-1} } \over { 1 - \widetilde{\mathbf{p}}^{T} \widetilde{P}^{-1} \widetilde{\mathbf{p}} }} \\ =& \widetilde{P}^{-1} + {{ I_{k-1} I_{k-1} } \over { 1 - I_{k-1} \widetilde{\mathbf{p}} }} \\ =& \widetilde{P}^{-1} + {{ I_{k-1} I_{k-1} } \over { 1 - I_{k-1} \widetilde{\mathbf{p}} }} \\ =& \widetilde{P}^{-1} + {{ 1 } \over { 1 - p_{1} - \cdots - p_{k-1} }} I_{k-1} \\ =& \widetilde{P}^{-1} + {{ 1 } \over { p_{k} }} I_{k-1} \\ =& \begin{bmatrix} {{ 1 } \over { p_{1} }} + {{ 1 } \over { p_{k} }} & 0 & 0 & \cdots & 0 \\ 0 & {{ 1 } \over { p_{2} }} + {{ 1 } \over { p_{k} }} & 0 & \cdots & 0 \\ 0 & 0 & {{ 1 } \over { p_{3} }} + {{ 1 } \over { p_{k} }} & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & {{ 1 } \over { p_{k-1} }} + {{ 1 } \over { p_{k} }} \end{bmatrix} \end{align*} Returning to SS and expressing it in matrix form, S=j=1k1(Xjnpj)2npj+(j=1k1(Xjnpj))2npk=1n[j=1k1(Xjnpj)1pj(Xjnpj)+j=1k1(Xjnpj)1pkj=1k1(Xjnpj)]=1n(N~np~)T(1nΣ~)1(N~np~)=(N~np~)T(Σ~1)(N~np~) \begin{align*} S =& \sum_{j=1}^{k-1} {{ \left( X_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} + {{ \left( \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) \right)^{2} } \over { n p_{k} }} \\ =& {{ 1 } \over { n }} \left[ \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) {{ 1 } \over { p_{j} }} \left( X_{j} - n p_{j} \right) + \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) {{ 1 } \over { p_{k} }} \sum_{j=1}^{k-1} \left( X_{j} - n p_{j} \right) \right] \\ =& {{ 1 } \over { n }} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right)^{T} \left( {{ 1 } \over { n }} \widetilde{\Sigma} \right)^{-1} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \\ =& \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right)^{T} \left( \widetilde{\Sigma}^{-1} \right) \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \end{align*}

De Moivre–Laplace theorem: If XiB(1,p)X_i \sim B(1,p) and Yn=X1+X2++XnY_n = X_1 + X_2 + \cdots + X_n, then YnB(n,p)Y_n \sim B(n,p), and Ynnpnp(1p)DN(0,1) { { Y_n - np } \over {\sqrt{ np(1-p) } } }\overset{D}{\to} N(0,1)

Since we already know that Σ~\widetilde{\Sigma} is the covariance matrix of N~\widetilde{N}, (N~np~)(0,Σ~) \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \sim \left( \mathbf{0} , \widetilde{\Sigma} \right) and according to the De Moivre–Laplace theorem, for the multivariate normal distribution Nk1\mathcal{N}_{k-1}, it can be expressed as follows: Σ~1/2(N~np~)DNk1(0,Ik1) \widetilde{\Sigma}^{ - 1/2} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \overset{D}{\to} \mathcal{N}_{k-1} \left( \mathbf{0} , I_{k-1} \right) At this point, each component that converges in distribution to follow the standard normal distribution is denoted by ZjN(0,1)Z_{j} \sim N \left( 0, 1 \right), and let’s define the characteristic function of SS as ϕ\phi. Since the covariance matrix of (Z1,,Zk1)\left( Z_{1} , \cdots ,Z_{k-1} \right) is Ik1I_{k-1}, when iji \ne j, both ZiZ_{i} and ZjZ_{j} are of course independent.

Lévy’s continuity theorem: Given a measurable space (Rd,B(Rd))\left( \mathbb{R}^{d} , \mathcal{B} \left( \mathbb{R}^{d} \right) \right), let’s denote the probability measure for nNn \in \overline{\mathbb{N}} by μn\mu_{n}, and the corresponding characteristic function by φn\varphi_{n}. The following are equivalent:

  • (a): μn\mu_{n} weakly converges to μ\mu_{\infty}.
  • (b): For all tRdt \in \mathbb{R}^{d}, limnφn(t)=φ(t)\lim_{n \to \infty} \varphi_{n} (t) = \varphi_{\infty} (t)

ϕ\phi, when nn \to \infty, according to Lévy’s continuity theorem for ZN(0,1)Z \sim N (0,1), ϕ(t)=E[eitS]=E[exp(it(N~np~)TΣ~1(N~np~))]DE[exp(it(Z12++Zk12))]=[E[exp(itZ2)]]k1 \begin{align*} \phi (t) =& E \left[ e^{itS} \right] \\ =& E \left[ \exp \left( it \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right)^{T} \widetilde{\Sigma}^{ - 1} \left( \widetilde{N} - n \widetilde{\mathbf{p}} \right) \right) \right] \\ \overset{D}{\to} & E \left[ \exp \left( it \left( Z^{2}_{1} + \cdots + Z^{2}_{k-1} \right) \right) \right] \\ = & \left[ E \left[ \exp \left( it Z^{2} \right) \right] \right]^{k-1} \end{align*}

Properties of the chi-squared distribution:

ϕ(t)D[E[exp(itZ2)]]k1=[1(12it)1/2]k1=(12it)(k1)/2 \begin{align*} \phi (t) \overset{D}{\to} & \left[ E \left[ \exp \left( it Z^{2} \right) \right] \right]^{k-1} \\ =& \left[ {{ 1 } \over { \left( 1 - 2it \right)^{1/2} }} \right]^{k-1} \\ =& (1-2it)^{-(k-1)/2} \end{align*} Therefore, SS converges in distribution to the chi-squared distribution χ2(k1)\chi^{2} \left( k-1 \right).


  1. https://ocw.mit.edu/courses/18-443-statistics-for-applications-fall-2003/708680f9de8209158ca6462577a46a56_lec23.pdf ↩︎

  2. Benhamou. (2018). Seven proofs of the Pearson Chi-squared independence test and its graphical interpretation: https://arxiv.org/abs/1808.09171 ↩︎