F-test in Analysis of Variance
Hypothesis Testing 1
Assume that there are $k$ treatments in experimental design, and that in each treatment there are $n_{j}$ samples, totaling $n = n_{1} + \cdots + n_{k}$ samples. Assume that each sample in the $j = 1 , \cdots , k$th treatment is independently and randomly drawn from a normal distribution $N \left( \mu_{j} , \sigma_{j}^{2} \right)$, and their population variance is equal such that $\sigma^{2} = \sigma_{1}^{2} = \cdots = \sigma_{k}^{2}$. In analysis of variance designed to compare population means, the process of hypothesis testing is as follows:
- $H_{0}$: $\mu_{1} = \cdots = \mu_{k}$
- $H_{1}$: At least one $\mu_{j}$ is different from the others.
Test Statistic
Assume we have the ANOVA table as follows:
Source | df | SS | MS | F |
---|---|---|---|---|
Treatments | $k-1$ | SST | MST | MST/MSE |
Error | $n-k$ | SSE | MSE | |
Total | $n-1$ | TSS |
The test statistic is as follows: $$ F = {\frac{ \text{MST} }{ \text{MSE} }} = {\frac{ \text{SST} / (k - 1) }{ \text{SSE} / (n - k) }} $$ This test statistic under the null hypothesis assumes it follows an F-distribution $F \left( k - 1 , n - k \right)$ with degrees of freedom $(k-1), (n-k)$.
Explanation
Whether it’s a one-way ANOVA or a two-way ANOVA, the mathematical derivations are very similar, differing only in the presence of blocks. For convenience, in this post, I’ll focus only on the theoretical background of one-way ANOVA under a completely randomized design.
Let’s explore how the results derived from the ANOVA table lead to hypothesis testing in the process of deriving the test statistic. Given that it requires a lot of prior knowledge in linear algebra and mathematical statistics, undergraduates might skip it, but graduate students are encouraged to challenge themselves.
Derivation 2
Part 1. $\sum Z_{i}^{2} = Q_{1} + Q_{2} + Q_{3}$
Let the treatment mean be $\bar{x}_{j} := \sum_{i} x_{ij} / n_{j}$, and the overall mean be $\bar{x} := \sum_{ij} x_{ij} / n$. $$ \begin{align*} \text{SST} =& \sum_{j=1}^{k} n_{j} \left( \bar{x}_{j} - \bar{x} \right)^{2} \\ \text{SSE} =& \left( n_{1} - 1 \right) s_{1}^{2} + \cdots + \left( n_{k} - 1 \right) s_{k}^{2} \\ \text{MST} =& {\frac{ \text{SST} }{ k - 1 }} \\ \text{MSE} =& {\frac{ \text{SSE} }{ n - k }} \\ F =& {\frac{ \text{MST} }{ \text{MSE} }} = {\frac{ \text{SST}/ (k - 1) }{ \text{SSE} / (n - k) }} \end{align*} $$
The ANOVA table for one-way ANOVA appears as above. Assuming the null hypothesis is true, for some $\mu$, we can set $\mu = \mu_{1} = \cdots = \mu_{k}$, and define the z-score $Z_{ij}$ as follows: $$ Z_{ij} := \left( {\frac{ x_{ij} - \mu }{ \sigma }} \right)^{2} \sim N \left( 0, 1 \right) $$ The sum of squares for $Z_{ij}$ can be expanded as follows: $$ \begin{align*} & \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} \\ =& \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ x_{ij} - \mu }{ \sigma }} \right)^{2} \\ =& {\frac{ 1 }{ \sigma^{2} }} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left[ \left( x_{ij} - \bar{x}_{j} \right) + \left( \bar{x}_{j} - \bar{x} \right) + \left( \bar{x} - \mu \right) \right]^{2} \\ =& {\frac{ 1 }{ \sigma^{2} }} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left[ \left( x_{ij} - \bar{x}_{j} \right)^{2} + \left( \bar{x}_{j} - \bar{x} \right)^{2} + \left( \bar{x} - \mu \right)^{2} \right] \\ &+ {\frac{ 2 }{ \sigma^{2} }} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left[ \left( x_{ij} - \bar{x}_{j} \right) \left( \bar{x}_{j} - \bar{x} \right) + \left( \bar{x}_{j} - \bar{x} \right) \left( \bar{x} - \mu \right) + \left( \bar{x} - \mu \right) \left( x_{ij} - \bar{x}_{j} \right) \right] \end{align*} $$ From which the last line is $$ \begin{align*} & \sum_{i=1}^{n_{j}} \left( x_{ij} - \bar{x}_{j} \right) \\ =& \sum_{i=1}^{n_{j}} x_{ij} - n_{j} {\frac{ 1 }{ n_{j} }} \sum_{i=1}^{n_{j}} x_{ij} \\ =& 0 \end{align*} $$ thus all terms reduce to $0$ and vanish, and the sum of squares for $Z_{ij}$ can be expressed as: $$ \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} = \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ x_{ij} - \bar{x}_{j} }{ \sigma }} \right)^{2} + \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ \bar{x}_{j} - \bar{x} }{ \sigma }} \right)^{2} + \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ \bar{x} - \mu }{ \sigma }} \right)^{2} $$ Now, label the three sigmas on the right side as $Q_{1}, Q_{2}, Q_{3}$ in order. $$ \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} = Q_{1} + Q_{2} + Q_{3} $$
Part 2. Cochran’s Theorem
Sum of squared deviations expressed in a quadratic form of a random vector: For a random vector $\mathbf{X} = \left( X_{1} , \cdots , X_{n} \right)$, an identity matrix $I_{n} \in \mathbb{R}^{n \times n}$, and an all-ones matrix $J_{n} \in \mathbb{R}^{n \times n}$ whose elements are all $1$, the following holds: $$ \mathbf{X}^{T} \left( I_{n} - {\frac{ 1 }{ n }} J_{n} \right) \mathbf{X} = ( n - 1 ) S^{2} $$ Here, $S^{2}$ represents the sample variance.
Let’s define three symmetric matrices $A_{1} , A_{2} , A_{3}$ using the identity matrix $I_{n}$ of size $n \times n$, an all-ones matrix $J_{n}$, and a block diagonal matrix $\diag$ where every element is $1$ as follows: $$ \begin{align*} A_{1} :=& I_{n} - \diag \left( {\frac{ 1 }{ n_{1} }} J_{n_{1}} , \cdots , {\frac{ 1 }{ n_{k} }} J_{n_{k}} \right) \\ A_{2} :=& \diag \left( {\frac{ 1 }{ n_{1} }} J_{n_{1}} , \cdots , {\frac{ 1 }{ n_{k} }} J_{n_{k}} \right) - {\frac{ 1 }{ n }} J_{n} \\ A_{3} :=& {\frac{ 1 }{ n }} J_{n} \end{align*} $$ Note that each block of the block diagonal matrix sets the stage for the $\left( n_{j} - 1 \right) s_{j}^{2}$ of each treatment $j$. From their definitions, the sum of these three matrices is $A_{1} + A_{2} + A_{3} = I_{n}$, and it is not hard to see that starting from the matrix’s rank being $1$, the rank of these three matrices is: $$ \begin{align*} \rank A_{1} =& n - k \\ \rank A_{2} =& k - 1 \\ \rank A_{3} =& 1 \end{align*} $$
Let’s define the vector $\mathbf{Z} \in \mathbb{R}^{n \times 1}$ for $n_{j}$ dimensional vectors $\left( x_{j1} , \cdots , x_{j n_{j}} \right) \in \mathbb{R}^{n_{j} \times 1}$ as follows: $$ \mathbf{Z} := \begin{bmatrix} z_{1} \\ \vdots \\ z_{k} \end{bmatrix} = \begin{bmatrix} \left( x_{11} , \cdots , x_{1 n_{1}} \right) \\ \vdots \\ \left( x_{1k} , \cdots , x_{k n_{k}} \right) \end{bmatrix} = \begin{bmatrix} x_{11} \\ \vdots \\ x_{k n_{k}} \end{bmatrix} $$ Thus, based on the vector notation, the sum of squares in $Z_{ij}$ can be represented as follows: $$ \begin{align*} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} =& Q_{1} + Q_{2} + Q_{3} \\ =& \mathbf{Z}^{T} A_{1} \mathbf{Z} + \mathbf{Z}^{T} A_{2} \mathbf{Z} + \mathbf{Z}^{T} A_{3} \mathbf{Z} \end{align*} $$
Cochran’s Theorem: Let the sample $\mathbf{X} = \left( X_{1} , \cdots , X_{n} \right)$ be following an iid normal distribution as $X_{1} , \cdots , X_{n} \overset{\text{iid}}{\sim} N \left( 0, \sigma^{2} \right)$. Given a symmetric matrix $A_{1} , \cdots , A_{k} \in \mathbb{R}^{n \times n}$ having a rank $r_{j}$, if a random variable $Q_{1} , \cdots , Q_{k}$ is expressed in the form of a quadratic form $Q_{i} := \mathbf{X}^{T} A_{i} \mathbf{X}$ and the sum of squares of samples equals $\sum_{i=1}^{n} X_{i}^{2} = \sum_{j=1}^{k} Q_{j}$, then the following is true: $$ \forall j , {\frac{ Q_{j} }{ \sigma^{2} }} \sim \chi^{2} \left( r_{j} \right) \land \forall j_{1} \ne j_{2} , Q_{j_{1}} \perp Q_{j_{2}} \iff \sum_{j=1}^{k} r_{j} = n $$ In other words, the fact that $Q_{j}$ are independent and follow a chi-squared distribution $\chi^{2} \left( r_{j} \right)$ is equivalent to the sum of the ranks $r_{j}$ being equal to the sample size $n$.
Each component in $\mathbf{Z}$ is independently observed from a standard normal distribution $N \left( 0 , 1^{2} \right)$ and given $\sum_{l=1}^{3} \rank A_{l} = n$, so according to Cochran’s theorem, $Q_{1}$ and $Q_{2}$ follow independent chi-squared distribution as follows: $$ \begin{align*} Q_{1} = {\frac{ Q_{1} }{ 1^{2} }} \sim& \chi^{2} \left( \rank A_{1} \right) = \chi^{2} \left( n - k \right) \\ Q_{2} = {\frac{ Q_{2} }{ 1^{2} }} \sim& \chi^{2} \left( \rank A_{2} \right) = \chi^{2} \left( k - 1 \right) \end{align*} $$ Meanwhile, these $Q_{1}$ and $Q_{2}$ can be represented for $\text{SSE}$ and $\text{SST}$ as follows: $$ \begin{align*} Q_{1} =& \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ x_{ij} - \bar{x}_{j} }{ \sigma }} \right)^{2} =& {\frac{ 1 }{ \sigma^{2} }} \left[ \left( n_{1} - 1 \right) s_{1}^{2} + \cdots + \left( n_{k} - 1 \right) s_{k}^{2} \right] =& {\frac{ 1 }{ \sigma^{2} }} \text{SSE} \\ Q_{2} =& \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ \bar{x}_{j} - \bar{x} }{ \sigma }} \right)^{2} =& {\frac{ 1 }{ \sigma^{2} }} \sum_{j=1}^{k} n_{j} \left( \bar{x}_{j} - \bar{x} \right)^{2} =& {\frac{ 1 }{ \sigma^{2} }} \text{SST} \end{align*} $$
Part 3. Deriving F-distribution
Deriving F-distribution from two Independent Chi-squared distributions: Suppose two random variables, $U,V$, are independent, and assume $U \sim \chi^{2} ( r_{1})$, $V \sim \chi^{2} ( r_{2})$, then $$ {{ U / r_{1} } \over { V / r_{2} }} \sim F \left( r_{1} , r_{2} \right) $$
$$ \begin{align*} F =& {\frac{ \text{MST} }{ \text{MSE} }} \\ =& {\frac{ \text{SST} / (k - 1) }{ \text{SSE} / (n - k) }} \\ =& {\frac{ \left( \text{SST} / \sigma^{2} \right) / (k - 1) }{ \left( \text{SSE} / \sigma^{2 }\right) / (n - k) }} \\ =& {\frac{ Q_{2} / (k-1)}{ Q_{1} / (n-k) }} \\ \sim& F \left( k - 1 , n - k \right) \end{align*} $$ Hence, it is confirmed that the test statistic $F$ under the assumption that the null hypothesis is true follows an F-distribution.
■
Further Reading
Mendenhall. (2012). Introduction to Probability and Statistics (13th Edition): p455. ↩︎