Pearson Chi-Square test statistic 📂Statistical Test

Pearson Chi-Square test statistic

Definition ¹

Consider a multinomial experiment where $k$ categories are drawn each with a probability of $p_{j} > 0$, and we obtain categorical data from $n$ independent trials. The frequency of data belonging to the $j$-th category $O_{j}$ is termed the observed cell count, while the expected value under the null hypothesis of hypothesis testing $E_{j}$ is called the expected cell count. The test statistic $$ \mathcal{X}^{2} := \sum_{j=1}^{k} {{ \left( O_{j} - E_{j} \right)^{2} } \over { E_{j} }} $$ is referred to as the Pearson Chi-square test statistic.

Explanation

Hypothesis Testing

$\mathcal{X}^{2}$ is a representative test statistic that freshmen encounter, often striking fear and awe in those who are only familiar with normal distribution or binomial distribution. For inexperienced individuals, understanding the chi-square distribution is challenging unless they have developed an intuition about data and statistical analysis. Therefore, a simplified explanation based on the formula is provided.

In most cases, a large $\mathcal{X}^{2}$ indicates a discrepancy between actual data and theoretical expectations. Observing the numerator of the formula, $\left( O_{j} - E_{j} \right)^{2} \ge 0$ is minimized when exactly $O_{j} = E_{j}$, i.e., when the observed data precisely matches the known theoretical probability $p_{j}$. The greater the discrepancy in these values, the larger the numerator can grow indefinitely.
Consequently, as the data increasingly deviates from the null hypothesis $H_{0}$, the value of $\mathcal{X}^{2}$ increases, typically leading to the rejection of the null hypothesis when $\mathcal{X}^{2}$ exceeds $\chi^{2}_{1-\alpha}$, in a right-tailed statistical test.
In simple terms, a large $\mathcal{X}^{2}$ signifies “something is very wrong.” The chi-square distribution is used to determine the extent of the deviation or dispersion.

The Pearson Chi-square test statistic for categorical data is typically used for the following purposes:

Theoretical Basis

If you’re reading further, you’re likely beyond the freshman level.

It is known via Student’s theorem that the square of a residual assumed to follow a normal distribution proportionally follows a chi-square distribution. However, even for undergraduates familiar with mathematical statistics, the structure of $\mathcal{X}^{2}$ might seem quite awkward. At first glance, it seems plausible, but the absence of an assumption that deviations follow a normal distribution makes it seem like an empirical statistic. Of course, statistics do not work in a haphazard way, and the properly proven Pearson’s theorem ensures the chi-square nature of $\mathcal{X}^{2}$.

Pearson’s Theorem: Given a sample size of $n \in \mathbb{N}$ and $k \in \mathbb{N}$ categories, let the random vector $\left( N_{1} , \cdots , N_{k} \right)$ follow the multinomial distribution $M_{k} \left( n ; \mathbf{p} \right)$. Then, when $n \to \infty$, the following statistic $S$ converges in distribution to a chi-square distribution $\chi^{2} \left( k - 1 \right)$. $$ S = \sum_{j=1}^{k} {{ \left( N_{j} - n p_{j} \right)^{2} } \over { n p_{j} }} \overset{D}{\to} \chi^{2} \left( k-1 \right) $$

The multinomial experiment introduced in the definition assumes that our data adheres to a multinomial distribution. According to Pearson’s theorem, if the sample is sufficiently large, it approximates a chi-square distribution with degrees of freedom $(k-1)$ derived from subtracting $1$ from the number of categories $k$. Although the proof of Pearson’s theorem is not straightforward, undergraduates can effectively utilize $\mathcal{X}^{2}$ even without comprehensive theoretical knowledge. However, those who decide to pursue graduate studies are encouraged to dedicate time to understand and prove it independently.

Mendenhall. (2012). Introduction to Probability and Statistics (13th Edition): p596. ↩︎