F-test in Analysis of Variance 📂Statistical Test

F-test in Analysis of Variance

Hypothesis Testing ¹

Assume that there are $k$ treatments in experimental design, and that in each treatment there are $n_{j}$ samples, totaling $n = n_{1} + \cdots + n_{k}$ samples. Assume that each sample in the $j = 1 , \cdots , k$th treatment is independently and randomly drawn from a normal distribution $N \left( \mu_{j} , \sigma_{j}^{2} \right)$, and their population variance is equal such that $\sigma^{2} = \sigma_{1}^{2} = \cdots = \sigma_{k}^{2}$. In analysis of variance designed to compare population means, the process of hypothesis testing is as follows:

$H_{0}$: $\mu_{1} = \cdots = \mu_{k}$
$H_{1}$: At least one $\mu_{j}$ is different from the others.

Test Statistic

Assume we have the ANOVA table as follows:

Source	df	SS	MS	F
Treatments	$k-1$	SST	MST	MST/MSE
Error	$n-k$	SSE	MSE
Total	$n-1$	TSS

The test statistic is as follows: $$ F = {\frac{ \text{MST} }{ \text{MSE} }} = {\frac{ \text{SST} / (k - 1) }{ \text{SSE} / (n - k) }} $$ This test statistic under the null hypothesis assumes it follows an F-distribution $F \left( k - 1 , n - k \right)$ with degrees of freedom $(k-1), (n-k)$.

Explanation

Whether it’s a one-way ANOVA or a two-way ANOVA, the mathematical derivations are very similar, differing only in the presence of blocks. For convenience, in this post, I’ll focus only on the theoretical background of one-way ANOVA under a completely randomized design.

Let’s explore how the results derived from the ANOVA table lead to hypothesis testing in the process of deriving the test statistic. Given that it requires a lot of prior knowledge in linear algebra and mathematical statistics, undergraduates might skip it, but graduate students are encouraged to challenge themselves.

Derivation ²

Part 1. $\sum Z_{i}^{2} = Q_{1} + Q_{2} + Q_{3}$

Let the treatment mean be $\bar{x}_{j} := \sum_{i} x_{ij} / n_{j}$, and the overall mean be $\bar{x} := \sum_{ij} x_{ij} / n$. $$ \begin{align*} \text{SST} =& \sum_{j=1}^{k} n_{j} \left( \bar{x}_{j} - \bar{x} \right)^{2} \\ \text{SSE} =& \left( n_{1} - 1 \right) s_{1}^{2} + \cdots + \left( n_{k} - 1 \right) s_{k}^{2} \\ \text{MST} =& {\frac{ \text{SST} }{ k - 1 }} \\ \text{MSE} =& {\frac{ \text{SSE} }{ n - k }} \\ F =& {\frac{ \text{MST} }{ \text{MSE} }} = {\frac{ \text{SST}/ (k - 1) }{ \text{SSE} / (n - k) }} \end{align*} $$

The ANOVA table for one-way ANOVA appears as above. Assuming the null hypothesis is true, for some $\mu$, we can set $\mu = \mu_{1} = \cdots = \mu_{k}$, and define the z-score $Z_{ij}$ as follows: $$ Z_{ij} := \left( {\frac{ x_{ij} - \mu }{ \sigma }} \right)^{2} \sim N \left( 0, 1 \right) $$ The sum of squares for $Z_{ij}$ can be expanded as follows: $$ \begin{align*} & \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} \\ =& \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ x_{ij} - \mu }{ \sigma }} \right)^{2} \\ =& {\frac{ 1 }{ \sigma^{2} }} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left[ \left( x_{ij} - \bar{x}_{j} \right) + \left( \bar{x}_{j} - \bar{x} \right) + \left( \bar{x} - \mu \right) \right]^{2} \\ =& {\frac{ 1 }{ \sigma^{2} }} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left[ \left( x_{ij} - \bar{x}_{j} \right)^{2} + \left( \bar{x}_{j} - \bar{x} \right)^{2} + \left( \bar{x} - \mu \right)^{2} \right] \\ &+ {\frac{ 2 }{ \sigma^{2} }} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left[ \left( x_{ij} - \bar{x}_{j} \right) \left( \bar{x}_{j} - \bar{x} \right) + \left( \bar{x}_{j} - \bar{x} \right) \left( \bar{x} - \mu \right) + \left( \bar{x} - \mu \right) \left( x_{ij} - \bar{x}_{j} \right) \right] \end{align*} $$ From which the last line is $$ \begin{align*} & \sum_{i=1}^{n_{j}} \left( x_{ij} - \bar{x}_{j} \right) \\ =& \sum_{i=1}^{n_{j}} x_{ij} - n_{j} {\frac{ 1 }{ n_{j} }} \sum_{i=1}^{n_{j}} x_{ij} \\ =& 0 \end{align*} $$ thus all terms reduce to $0$ and vanish, and the sum of squares for $Z_{ij}$ can be expressed as: $$ \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} = \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ x_{ij} - \bar{x}_{j} }{ \sigma }} \right)^{2} + \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ \bar{x}_{j} - \bar{x} }{ \sigma }} \right)^{2} + \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ \bar{x} - \mu }{ \sigma }} \right)^{2} $$ Now, label the three sigmas on the right side as $Q_{1}, Q_{2}, Q_{3}$ in order. $$ \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} = Q_{1} + Q_{2} + Q_{3} $$

Part 2. Cochran’s Theorem

Sum of squared deviations expressed in a quadratic form of a random vector: For a random vector $\mathbf{X} = \left( X_{1} , \cdots , X_{n} \right)$, an identity matrix $I_{n} \in \mathbb{R}^{n \times n}$, and an all-ones matrix $J_{n} \in \mathbb{R}^{n \times n}$ whose elements are all $1$, the following holds: $$ \mathbf{X}^{T} \left( I_{n} - {\frac{ 1 }{ n }} J_{n} \right) \mathbf{X} = ( n - 1 ) S^{2} $$ Here, $S^{2}$ represents the sample variance.

Let’s define three symmetric matrices $A_{1} , A_{2} , A_{3}$ using the identity matrix $I_{n}$ of size $n \times n$, an all-ones matrix $J_{n}$, and a block diagonal matrix $\diag$ where every element is $1$ as follows: $$ \begin{align*} A_{1} :=& I_{n} - \diag \left( {\frac{ 1 }{ n_{1} }} J_{n_{1}} , \cdots , {\frac{ 1 }{ n_{k} }} J_{n_{k}} \right) \\ A_{2} :=& \diag \left( {\frac{ 1 }{ n_{1} }} J_{n_{1}} , \cdots , {\frac{ 1 }{ n_{k} }} J_{n_{k}} \right) - {\frac{ 1 }{ n }} J_{n} \\ A_{3} :=& {\frac{ 1 }{ n }} J_{n} \end{align*} $$ Note that each block of the block diagonal matrix sets the stage for the $\left( n_{j} - 1 \right) s_{j}^{2}$ of each treatment $j$. From their definitions, the sum of these three matrices is $A_{1} + A_{2} + A_{3} = I_{n}$, and it is not hard to see that starting from the matrix’s rank being $1$, the rank of these three matrices is: $$ \begin{align*} \rank A_{1} =& n - k \\ \rank A_{2} =& k - 1 \\ \rank A_{3} =& 1 \end{align*} $$

Let’s define the vector $\mathbf{Z} \in \mathbb{R}^{n \times 1}$ for $n_{j}$ dimensional vectors $\left( x_{j1} , \cdots , x_{j n_{j}} \right) \in \mathbb{R}^{n_{j} \times 1}$ as follows: $$ \mathbf{Z} := \begin{bmatrix} z_{1} \\ \vdots \\ z_{k} \end{bmatrix} = \begin{bmatrix} \left( x_{11} , \cdots , x_{1 n_{1}} \right) \\ \vdots \\ \left( x_{1k} , \cdots , x_{k n_{k}} \right) \end{bmatrix} = \begin{bmatrix} x_{11} \\ \vdots \\ x_{k n_{k}} \end{bmatrix} $$ Thus, based on the vector notation, the sum of squares in $Z_{ij}$ can be represented as follows: $$ \begin{align*} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} =& Q_{1} + Q_{2} + Q_{3} \\ =& \mathbf{Z}^{T} A_{1} \mathbf{Z} + \mathbf{Z}^{T} A_{2} \mathbf{Z} + \mathbf{Z}^{T} A_{3} \mathbf{Z} \end{align*} $$

Cochran’s Theorem: Let the sample $\mathbf{X} = \left( X_{1} , \cdots , X_{n} \right)$ be following an iid normal distribution as $X_{1} , \cdots , X_{n} \overset{\text{iid}}{\sim} N \left( 0, \sigma^{2} \right)$. Given a symmetric matrix $A_{1} , \cdots , A_{k} \in \mathbb{R}^{n \times n}$ having a rank $r_{j}$, if a random variable $Q_{1} , \cdots , Q_{k}$ is expressed in the form of a quadratic form $Q_{i} := \mathbf{X}^{T} A_{i} \mathbf{X}$ and the sum of squares of samples equals $\sum_{i=1}^{n} X_{i}^{2} = \sum_{j=1}^{k} Q_{j}$, then the following is true: $$ \forall j , {\frac{ Q_{j} }{ \sigma^{2} }} \sim \chi^{2} \left( r_{j} \right) \land \forall j_{1} \ne j_{2} , Q_{j_{1}} \perp Q_{j_{2}} \iff \sum_{j=1}^{k} r_{j} = n $$ In other words, the fact that $Q_{j}$ are independent and follow a chi-squared distribution $\chi^{2} \left( r_{j} \right)$ is equivalent to the sum of the ranks $r_{j}$ being equal to the sample size $n$.

Each component in $\mathbf{Z}$ is independently observed from a standard normal distribution $N \left( 0 , 1^{2} \right)$ and given $\sum_{l=1}^{3} \rank A_{l} = n$, so according to Cochran’s theorem, $Q_{1}$ and $Q_{2}$ follow independent chi-squared distribution as follows: $$ \begin{align*} Q_{1} = {\frac{ Q_{1} }{ 1^{2} }} \sim& \chi^{2} \left( \rank A_{1} \right) = \chi^{2} \left( n - k \right) \\ Q_{2} = {\frac{ Q_{2} }{ 1^{2} }} \sim& \chi^{2} \left( \rank A_{2} \right) = \chi^{2} \left( k - 1 \right) \end{align*} $$ Meanwhile, these $Q_{1}$ and $Q_{2}$ can be represented for $\text{SSE}$ and $\text{SST}$ as follows: $$ \begin{align*} Q_{1} =& \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ x_{ij} - \bar{x}_{j} }{ \sigma }} \right)^{2} =& {\frac{ 1 }{ \sigma^{2} }} \left[ \left( n_{1} - 1 \right) s_{1}^{2} + \cdots + \left( n_{k} - 1 \right) s_{k}^{2} \right] =& {\frac{ 1 }{ \sigma^{2} }} \text{SSE} \\ Q_{2} =& \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ \bar{x}_{j} - \bar{x} }{ \sigma }} \right)^{2} =& {\frac{ 1 }{ \sigma^{2} }} \sum_{j=1}^{k} n_{j} \left( \bar{x}_{j} - \bar{x} \right)^{2} =& {\frac{ 1 }{ \sigma^{2} }} \text{SST} \end{align*} $$

Part 3. Deriving F-distribution

Deriving F-distribution from two Independent Chi-squared distributions: Suppose two random variables, $U,V$, are independent, and assume $U \sim \chi^{2} ( r_{1})$, $V \sim \chi^{2} ( r_{2})$, then $$ {{ U / r_{1} } \over { V / r_{2} }} \sim F \left( r_{1} , r_{2} \right) $$

$$ \begin{align*} F =& {\frac{ \text{MST} }{ \text{MSE} }} \\ =& {\frac{ \text{SST} / (k - 1) }{ \text{SSE} / (n - k) }} \\ =& {\frac{ \left( \text{SST} / \sigma^{2} \right) / (k - 1) }{ \left( \text{SSE} / \sigma^{2 }\right) / (n - k) }} \\ =& {\frac{ Q_{2} / (k-1)}{ Q_{1} / (n-k) }} \\ \sim& F \left( k - 1 , n - k \right) \end{align*} $$ Hence, it is confirmed that the test statistic $F$ under the assumption that the null hypothesis is true follows an F-distribution.

■