Fitness Test of a group 📂Statistical Test

Fitness Test of a group

Hypothesis Testing ¹

Given categorical data obtained from $n$ independent trials in a multinomial experiment where $k$ categories are each theoretically drawn with a probability of $p_{j} > 0$ , the following hypothesis test using the Pearson’s chi-square test statistic is known as a goodness of fit test.

$H_{0}$ : The given data has been sampled to conform to the theoretical probabilities.
$H_{1}$ : The given data has not been sampled to conform to the theoretical probabilities.

Test Statistic

The test statistic used is Pearson’s chi-square test statistic. $\mathcal{X}^{2} = \sum_{j=1}^{k} {{ \left( O_{j} - E_{j} \right)^{2} } \over { E_{j} }}$ Here, the frequency $O_{j}$ of data belonging to the $j$ th category is termed as observed frequency, and $E := n p_{j}$ , which is the product of the sample size $n$ and probability $p_{j}$ , is termed as expected frequency. This test statistic follows a chi-square distribution, with the degrees of freedom being $(k-1)$ , which is calculated by subtracting $1$ from the number of categories $k$ .

Explanation

Degrees of Freedom

Since the probabilities $p_{1} , \cdots , p_{k}$ satisfy $\sum_{j=1}^{k} p_{j} = 1$ , knowing only $k-1$ probabilities determines the rest, which is why the degrees of freedom should be $k-1$ .

What is Fit?

First, let’s address the expression Goodness of Fit Test. In the broader field of data science, it’s worth noting that the term fit is used a bit differently than in general societal context. In Korean, the adjective 적합하다 (“to fit”) doesn’t make sense when used in phrases like ‘fit it’ or ‘after fitting,’ but those expressions do occasionally pop out in hurried explanations in the field. This is because the English word fit can be used as a verb, but the purist adapted Korean term 적합 hasn’t developed into a verb usage. Using it in a gerund form like fitting makes much more sense than the previous attempt.

So what is this fitting then? When referred to as fitting rather than fittingness, it aligns with nearly universal usage. For example, when the job of wearing clothes for sale is referred to as being a model, saying fit is good implies that it means ’the clothes look good on the model because she has a good coat hanger.’ Similarly, in data science, a theoretical model explaining our data is evaluated for how well it fits the data, and this measure is termed goodness of fit.

In short, goodness of fit refers to the degree to which data conforms to theory, and goodness of fit test statistically tests that measure.

Applications

The applications of goodness of fit tests are vast, such as blood types, MBTI, gender ratios, and distribution of occupations. Although one might think this only applies to categorical data because the hypotheses are based on multinomial experiments, in reality, any theoretically distributed data can be applied by transforming quantitative data categorization into categorical data.

Example

Mendel’s Law of Inheritance

According to Mendel’s Law of Inheritance, as depicted above, yellow is dominant in color, while green is recessive, and round is dominant in shape, while wrinkled is recessive. After crossing purebred yellow-round and green-wrinkled plants, the first generation will follow the dominant trait, being yellow and round. In the second generation, yellow vs. green will appear in a 3:1 ratio, as will round vs. wrinkled, resulting in an overall 9:3:3:1 ratio ².

	Yellow Round	Yellow Wrinkled	Green Round	Green Wrinkled
Observed Frequency	773	231	238	59

If the observed frequency of the second generation peas is as shown above ³, the total sample size is $n = 1301$ and number of categories is $k = 4$ . Let’s conduct a goodness of fit test to see if the experimental results conform to our theoretical expectation of a 9:3:3:1 ratio at the significance level $\alpha = 0.05$ .

$H_{0}$ : The experiment conforms to Mendel’s Law of Inheritance.
$H_{1}$ : The experiment does not conform to Mendel’s Law of Inheritance.

The null hypothesis and alternative hypothesis are as above, and the Pearson’s chi-square test statistic is calculated as follows: $\begin{align*} \mathcal{X}^{2} =& \sum_{j=1}^{k} {{ \left( O_{j} - E_{j} \right)^{2} } \over { E_{j} }} \\ =& {{ \left( 773 - n 9 / 16 \right)^{2} } \over { n 9 / 16 }} + {{ \left( 231 - n 3 / 16 \right)^{2} } \over { n 3 / 16 }} + {{ \left( 238 - n 3 / 16 \right)^{2} } \over { n 3 / 16 }} + {{ \left( 59 - n / 16 \right)^{2} } \over { n / 16 }} \\ \approx & 2.32 + 0.69 + 0.14 + 6.12 \\ \approx & 9.27 \end{align*}$ For the given significance level $\alpha = 0.05$ , the upper limit of the rejection region follows the chi-square distribution with degrees of freedom $3$ , which is $\chi_{1 - \alpha}^{2} (3) \approx 7.815$ . Since $\chi_{1 - \alpha}^{2} < \mathcal{X}^{2}$ , the null hypothesis is rejected. In other words, the above experiment does not conform to Mendel’s Law of Inheritance at the significance level $\alpha = 0.05$ .

Since there was mention that the experiment was a bit strange, upon reevaluation, the ratio of Yellow Round to Green Wrinkled is approximately 13:1, which significantly deviates from the theoretical expectation of 9:1. Comparing with other phenotypes, Yellow Round is too many, and Green Wrinkled is too few, and the goodness of fit test confirmed this statistically, not just intuitively.

Note here that the conclusion about the experiment not conforming to theory is based on the large value of $\mathcal{X}^{2}$ . Pearson’s chi-square test statistic is essentially the sum of $\left( O_{j} - E_{j} \right)^{2}$ , a value that increases as the discrepancy between observed and expected frequencies becomes severe, which is natural. Memorizing terms like significance level and right-tail test without understanding the formula makes it difficult and dull. However, at least in the case of goodness of fit tests, $\mathcal{X}^{2}$ can be viewed as a penalty received for the experiment deviating from theory.