Fitness Test of a group
Hypothesis Testing 1
Assuming that in a multinomial experiment, where $k$ categories are each theoretically sampled with a probability of $p_{j} > 0$, categorical data obtained from $n$ independent trials are given. The following hypothesis test using the Pearson Chi-square test statistic is called a Goodness of Fit Test.
- $H_{0}$: The given data has been sampled in accordance with theoretical probabilities.
- $H_{1}$: The given data has not been sampled in accordance with theoretical probabilities.
test statistic
The test statistic uses the Pearson Chi-square test statistic. $$ \mathcal{X}^{2} = \sum_{j=1}^{k} {{ \left( O_{j} - E_{j} \right)^{2} } \over { E_{j} }} $$ Here, the frequency $O_{j}$ of data belonging to the $j$th category is called the observed frequency, and the product of sample size and probability $p_{j}$ $E := n p_{j}$ is called the expected frequency. This test statistic follows a chi-square distribution, with its degrees of freedom being $k$ minus $1$, which is $(k-1)$.
Explanation
Goodness of Fit?
First, it’s worthwhile to discuss the term Goodness of Fit Test itself. In the field of data science, the usage of the term fit is slightly different from how it’s used in general society. In Korean, while it doesn’t make sense to say ‘fit it’ or ‘after fitting’ as fit is an adjective, such expressions can pop up in this field in a rush. This is because, although Fit can be used as a verb in English, the chosen Korean synonym 적합 is not habitually used as a verb; however, using it in the gerund form Fitting as in ‘do the fitting’ or ‘after the fitting’ makes much more sense.
So, what is this fitting? When we specifically use the term fitting, it almost universally aligns with its common usage. For instance, in the context of someone whose job is to wear clothes, being a model, saying the fit is good means ’the clothes match well with the model because of their good frame.’ Similarly, in data science, the measure of how well a theoretical model fits our data is called Goodness of Fit.
In summary, the degree to which data matches the theory is referred to as goodness of fit, and Goodness of Fit Testing is a statistical test for this measure.
Applications
The applications of goodness of fit tests are endless, including distributions of blood types, MBTI, gender ratios, and sectors of employment. Although it appears that this test can only be used for categorical data since it presupposes a multinomial experiment, in reality, quantitative data can be converted into categorical data through classification, allowing its application to any theoretically defined distribution.
Example
Mendel’s Laws of Inheritance
According to Mendel’s Laws of Inheritance, considering colors, yellow is dominant over green, and round is dominant over wrinkled. When a purebred strain of yellow round peas is crossed with a purebred strain of green wrinkled peas, the phenotype of the first generation follows the dominant trait, being yellow and round, while in the second generation, yellow and green appear in a 3:1 ratio, round and wrinkled appear in a 3:1 ratio, resulting in an overall ratio of 9:3:3:1.2
Yellow Round | Yellow Wrinkled | Green Round | Green Wrinkled | |
---|---|---|---|---|
Observed Frequency | 773 | 231 | 238 | 59 |
Given that the observed frequency of the second generation peas is as above 3, with total sample size $n = 1301$ and the number of categories $k = 4$, let’s test whether the experiment’s results conform to our theoretical expectation of 9:3:3:1 at the significance level $\alpha = 0.05$.
- $H_{0}$: The result of the experiment conforms to Mendel’s laws of inheritance.
- $H_{1}$: The result of the experiment does not conform to Mendel’s laws of inheritance.
First, the null and alternative hypotheses are as above, and the Pearson Chi-square test statistic is calculated as: $$ \begin{align*} \mathcal{X}^{2} =& \sum_{j=1}^{k} {{ \left( O_{j} - E_{j} \right)^{2} } \over { E_{j} }} \\ =& {{ \left( 773 - n 9 / 16 \right)^{2} } \over { n 9 / 16 }} + {{ \left( 231 - n 3 / 16 \right)^{2} } \over { n 3 / 16 }} + {{ \left( 238 - n 3 / 16 \right)^{2} } \over { n 3 / 16 }} + {{ \left( 59 - n / 16 \right)^{2} } \over { n / 16 }} \\ =& 9.20 \end{align*} $$ Given the significance level $\alpha = 0.05$, the upper limit of the rejection region is $\chi_{1 - \alpha}^{2} = 7.815$, and since $\chi_{1 - \alpha}^{2} < \mathcal{X}^{2}$, the null hypothesis is rejected. In other words, at the significance level $\alpha = 0.05$, the experiment results do not conform to Mendel’s laws of inheritance.
Considering the abnormal results of the experiment, yellow round and green wrinkled having a ratio of 773:59 deviates significantly from the theoretical expectation of 9:1. Compared to the other phenotypes, there are too many yellow rounds and too few green wrinkleds, but the goodness of fit test precisely confirms this statistically, not just by estimation.
It’s important to note here that the rejection is based on the value of $\mathcal{X}^{2}$ being large, which makes sense because the Pearson Chi-square test statistic essentially increases as the discrepancy between observed and expected frequencies grows. Without diving into equations, talking about rejection regions and right-tail tests may seem difficult and uninteresting. At least in the context of goodness of fit testing, it’s fair to view $\mathcal{X}^{2}$ as the penalty points received for the experiment’s deviation from the theory.
Mendenhall. (2012). Introduction to Probability and Statistics (13th Edition): p597. ↩︎
http://legacy.biotechlearn.org.nz/themes/mendel_and_inheritance/images/inheritance_of_multiple_traits_in_peas ↩︎
경북대학교 통계학과. (2008). 엑셀을 이용한 통계학: p269. ↩︎