Let’s say that the random vector (N1,⋯,Nk) follows the multinomial distribution Mk(n;p) for p=(p1,⋯,pk)∈[0,1]k that satisfies i=1∑kNi=n&i=1∑kpi=1, and the sample sizes n∈N and k∈N categories. Then, when n→∞, the statistic S converges in distribution to the chi-squared distribution χ2(k−1).
S:=j=1∑knpj(Nj−npj)2→Dχ2(k−1)
χ2(r) refers to the chi-squared distribution with degrees of freedom r.
Description
Strictly speaking, the term Pearson’s theorem is not often used. Honestly, I’ve only seen it once1, and it’s usually referred to simply as the Pearson Chi-squared StatisticX2:=j=1∑kEj(Oj−Ej)2 converging to the chi-squared distribution. This is because the conversation often moves on to practical discussions of hypothesis testing rather than studying it as a “theorem” per se, suggesting a tendency to neglect rigorous mathematical proofs.
Let’s say that all components of p are pj>0, and
S:=j=1∑knpj(Xj−npj)2
Given the constraint ∑j=1k(Nj−npj)=0, excluding the last kth term yields
S===j=1∑knpj(Xj−npj)2j=1∑k−1npj(Xj−npj)2+npk(Xk−npk)2j=1∑k−1npj(Xj−npj)2+npk(∑j=1k−1(Xj−npj))2
In fact, if there’s a constraint, Nk is not necessary, and for N:=(N1,⋯,Nk−1) and p:=(p1,⋯,pk−1) without the kth component, consider the covariance matrix Σ for N.
Covariance matrix of the multinomial distribution: If the random vector X:=(X1,⋯,Xk) follows the multinomial distribution Mk(n,p), then the covariance matrix is as follows:
Cov(X)=np1(1−p1)−p2p1⋮−pkp1−p1p2p2(1−p2)⋮−pkp2⋯⋯⋱⋯−p1pk−p2p2⋮pk(1−pk)
Whether there’s a kth component or not, there’s no reason for Cov(Ni,Nj) to change, thus the following is obtained for the identity matrix Ik−1:
======n1Σn1Cov(N)p1(1−p1)−p2p1⋮−pk−1p1−p1p2p2(1−p2)⋮−pk−1p2⋯⋯⋱⋯−p1pk−1−p2p2⋮pk−1(1−pk−1)p10⋮00p2⋮0⋯⋯⋱⋯00⋮pk−1−−p12−p2p1⋮−pk−1p1−p1p2−p22⋮−pk−1p2⋯⋯⋱⋯−p1pk−1−p2p2⋮−pk−12p10⋮00p2⋮0⋯⋯⋱⋯00⋮pk−1−p1p2⋮pk−1[p1p2⋯pk−1]p10⋮00p2⋮0⋯⋯⋱⋯00⋮pk−1−ppTIk−1p−ppT
Here, pT represents the transpose of p. Now, let’s set it as P:=Ik−1p.
Sherman–Morrison formula: When (A+uvT)−1 exists, the specific formula is as follows:
(A+uvT)−1=A−1−1+vTA−1uA−1uvTA−1
Calculating the inverse of n1Σ, since the inverse of the diagonal matrix
P=Ik−1p=diag(p1,⋯,pk−1)
is the diagonal matrix P−1=diag(p1−1,⋯,pk−1−1) with reciprocal diagonal components, according to the Sherman-Morrison formula,
=======(n1Σ)−1(P+(−ppT))−1P−1+1−pTP−1pP−1ppTP−1P−1+1−Ik−1pIk−1Ik−1P−1+1−Ik−1pIk−1Ik−1P−1+1−p1−⋯−pk−11Ik−1P−1+pk1Ik−1p11+pk100⋮00p21+pk10⋮000p31+pk1⋮0⋯⋯⋯⋱⋯000⋮pk−11+pk1
Returning to S and expressing it in matrix form,
S====j=1∑k−1npj(Xj−npj)2+npk(∑j=1k−1(Xj−npj))2n1[j=1∑k−1(Xj−npj)pj1(Xj−npj)+j=1∑k−1(Xj−npj)pk1j=1∑k−1(Xj−npj)]n1(N−np)T(n1Σ)−1(N−np)(N−np)T(Σ−1)(N−np)
De Moivre–Laplace theorem: If Xi∼B(1,p) and Yn=X1+X2+⋯+Xn, then Yn∼B(n,p), and
np(1−p)Yn−np→DN(0,1)
Since we already know that Σ is the covariance matrix of N,
(N−np)∼(0,Σ)
and according to the De Moivre–Laplace theorem, for the multivariate normal distribution Nk−1, it can be expressed as follows:
Σ−1/2(N−np)→DNk−1(0,Ik−1)
At this point, each component that converges in distribution to follow the standard normal distribution is denoted by Zj∼N(0,1), and let’s define the characteristic function of S as ϕ. Since the covariance matrix of (Z1,⋯,Zk−1) is Ik−1, when i=j, both Zi and Zj are of course independent.
Lévy’s continuity theorem: Given a measurable space (Rd,B(Rd)), let’s denote the probability measure for n∈N by μn, and the corresponding characteristic function by φn. The following are equivalent:
(a): μn weakly converges to μ∞.
(b): For all t∈Rd,
n→∞limφn(t)=φ∞(t)
ϕ, when n→∞, according to Lévy’s continuity theorem for Z∼N(0,1),
ϕ(t)==→D=E[eitS]E[exp(it(N−np)TΣ−1(N−np))]E[exp(it(Z12+⋯+Zk−12))][E[exp(itZ2)]]k−1