logo

F-test in Analysis of Variance 📂Statistical Test

F-test in Analysis of Variance

Hypothesis Testing 1

Assume that there are kk treatments in experimental design, and that in each treatment there are njn_{j} samples, totaling n=n1++nkn = n_{1} + \cdots + n_{k} samples. Assume that each sample in the j=1,,kj = 1 , \cdots , kth treatment is independently and randomly drawn from a normal distribution N(μj,σj2)N \left( \mu_{j} , \sigma_{j}^{2} \right), and their population variance is equal such that σ2=σ12==σk2\sigma^{2} = \sigma_{1}^{2} = \cdots = \sigma_{k}^{2}. In analysis of variance designed to compare population means, the process of hypothesis testing is as follows:

  • H0H_{0}: μ1==μk\mu_{1} = \cdots = \mu_{k}
  • H1H_{1}: At least one μj\mu_{j} is different from the others.

Test Statistic

Assume we have the ANOVA table as follows:

SourcedfSSMSF
Treatmentsk1k-1SSTMSTMST/MSE
Errornkn-kSSEMSE
Totaln1n-1TSS

The test statistic is as follows: F=MSTMSE=SST/(k1)SSE/(nk) F = {\frac{ \text{MST} }{ \text{MSE} }} = {\frac{ \text{SST} / (k - 1) }{ \text{SSE} / (n - k) }} This test statistic under the null hypothesis assumes it follows an F-distribution F(k1,nk)F \left( k - 1 , n - k \right) with degrees of freedom (k1),(nk)(k-1), (n-k).

Explanation

Whether it’s a one-way ANOVA or a two-way ANOVA, the mathematical derivations are very similar, differing only in the presence of blocks. For convenience, in this post, I’ll focus only on the theoretical background of one-way ANOVA under a completely randomized design.

Let’s explore how the results derived from the ANOVA table lead to hypothesis testing in the process of deriving the test statistic. Given that it requires a lot of prior knowledge in linear algebra and mathematical statistics, undergraduates might skip it, but graduate students are encouraged to challenge themselves.

Derivation 2

Part 1. Zi2=Q1+Q2+Q3\sum Z_{i}^{2} = Q_{1} + Q_{2} + Q_{3}

Let the treatment mean be xˉj:=ixij/nj\bar{x}_{j} := \sum_{i} x_{ij} / n_{j}, and the overall mean be xˉ:=ijxij/n\bar{x} := \sum_{ij} x_{ij} / n. SST=j=1knj(xˉjxˉ)2SSE=(n11)s12++(nk1)sk2MST=SSTk1MSE=SSEnkF=MSTMSE=SST/(k1)SSE/(nk) \begin{align*} \text{SST} =& \sum_{j=1}^{k} n_{j} \left( \bar{x}_{j} - \bar{x} \right)^{2} \\ \text{SSE} =& \left( n_{1} - 1 \right) s_{1}^{2} + \cdots + \left( n_{k} - 1 \right) s_{k}^{2} \\ \text{MST} =& {\frac{ \text{SST} }{ k - 1 }} \\ \text{MSE} =& {\frac{ \text{SSE} }{ n - k }} \\ F =& {\frac{ \text{MST} }{ \text{MSE} }} = {\frac{ \text{SST}/ (k - 1) }{ \text{SSE} / (n - k) }} \end{align*}

The ANOVA table for one-way ANOVA appears as above. Assuming the null hypothesis is true, for some μ\mu, we can set μ=μ1==μk\mu = \mu_{1} = \cdots = \mu_{k}, and define the z-score ZijZ_{ij} as follows: Zij:=(xijμσ)2N(0,1) Z_{ij} := \left( {\frac{ x_{ij} - \mu }{ \sigma }} \right)^{2} \sim N \left( 0, 1 \right) The sum of squares for ZijZ_{ij} can be expanded as follows: j=1ki=1njZij2=j=1ki=1nj(xijμσ)2=1σ2j=1ki=1nj[(xijxˉj)+(xˉjxˉ)+(xˉμ)]2=1σ2j=1ki=1nj[(xijxˉj)2+(xˉjxˉ)2+(xˉμ)2]+2σ2j=1ki=1nj[(xijxˉj)(xˉjxˉ)+(xˉjxˉ)(xˉμ)+(xˉμ)(xijxˉj)] \begin{align*} & \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} \\ =& \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ x_{ij} - \mu }{ \sigma }} \right)^{2} \\ =& {\frac{ 1 }{ \sigma^{2} }} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left[ \left( x_{ij} - \bar{x}_{j} \right) + \left( \bar{x}_{j} - \bar{x} \right) + \left( \bar{x} - \mu \right) \right]^{2} \\ =& {\frac{ 1 }{ \sigma^{2} }} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left[ \left( x_{ij} - \bar{x}_{j} \right)^{2} + \left( \bar{x}_{j} - \bar{x} \right)^{2} + \left( \bar{x} - \mu \right)^{2} \right] \\ &+ {\frac{ 2 }{ \sigma^{2} }} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left[ \left( x_{ij} - \bar{x}_{j} \right) \left( \bar{x}_{j} - \bar{x} \right) + \left( \bar{x}_{j} - \bar{x} \right) \left( \bar{x} - \mu \right) + \left( \bar{x} - \mu \right) \left( x_{ij} - \bar{x}_{j} \right) \right] \end{align*} From which the last line is i=1nj(xijxˉj)=i=1njxijnj1nji=1njxij=0 \begin{align*} & \sum_{i=1}^{n_{j}} \left( x_{ij} - \bar{x}_{j} \right) \\ =& \sum_{i=1}^{n_{j}} x_{ij} - n_{j} {\frac{ 1 }{ n_{j} }} \sum_{i=1}^{n_{j}} x_{ij} \\ =& 0 \end{align*} thus all terms reduce to 00 and vanish, and the sum of squares for ZijZ_{ij} can be expressed as: j=1ki=1njZij2=j=1ki=1nj(xijxˉjσ)2+j=1ki=1nj(xˉjxˉσ)2+j=1ki=1nj(xˉμσ)2 \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} = \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ x_{ij} - \bar{x}_{j} }{ \sigma }} \right)^{2} + \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ \bar{x}_{j} - \bar{x} }{ \sigma }} \right)^{2} + \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ \bar{x} - \mu }{ \sigma }} \right)^{2} Now, label the three sigmas on the right side as Q1,Q2,Q3Q_{1}, Q_{2}, Q_{3} in order. j=1ki=1njZij2=Q1+Q2+Q3 \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} = Q_{1} + Q_{2} + Q_{3}


Part 2. Cochran’s Theorem

Sum of squared deviations expressed in a quadratic form of a random vector: For a random vector X=(X1,,Xn)\mathbf{X} = \left( X_{1} , \cdots , X_{n} \right), an identity matrix InRn×nI_{n} \in \mathbb{R}^{n \times n}, and an all-ones matrix JnRn×nJ_{n} \in \mathbb{R}^{n \times n} whose elements are all 11, the following holds: XT(In1nJn)X=(n1)S2 \mathbf{X}^{T} \left( I_{n} - {\frac{ 1 }{ n }} J_{n} \right) \mathbf{X} = ( n - 1 ) S^{2} Here, S2S^{2} represents the sample variance.

Let’s define three symmetric matrices A1,A2,A3A_{1} , A_{2} , A_{3} using the identity matrix InI_{n} of size n×nn \times n, an all-ones matrix JnJ_{n}, and a block diagonal matrix diag\diag where every element is 11 as follows: A1:=Indiag(1n1Jn1,,1nkJnk)A2:=diag(1n1Jn1,,1nkJnk)1nJnA3:=1nJn \begin{align*} A_{1} :=& I_{n} - \diag \left( {\frac{ 1 }{ n_{1} }} J_{n_{1}} , \cdots , {\frac{ 1 }{ n_{k} }} J_{n_{k}} \right) \\ A_{2} :=& \diag \left( {\frac{ 1 }{ n_{1} }} J_{n_{1}} , \cdots , {\frac{ 1 }{ n_{k} }} J_{n_{k}} \right) - {\frac{ 1 }{ n }} J_{n} \\ A_{3} :=& {\frac{ 1 }{ n }} J_{n} \end{align*} Note that each block of the block diagonal matrix sets the stage for the (nj1)sj2\left( n_{j} - 1 \right) s_{j}^{2} of each treatment jj. From their definitions, the sum of these three matrices is A1+A2+A3=InA_{1} + A_{2} + A_{3} = I_{n}, and it is not hard to see that starting from the matrix’s rank being 11, the rank of these three matrices is: rankA1=nkrankA2=k1rankA3=1 \begin{align*} \rank A_{1} =& n - k \\ \rank A_{2} =& k - 1 \\ \rank A_{3} =& 1 \end{align*}

Let’s define the vector ZRn×1\mathbf{Z} \in \mathbb{R}^{n \times 1} for njn_{j} dimensional vectors (xj1,,xjnj)Rnj×1\left( x_{j1} , \cdots , x_{j n_{j}} \right) \in \mathbb{R}^{n_{j} \times 1} as follows: Z:=[z1zk]=[(x11,,x1n1)(x1k,,xknk)]=[x11xknk] \mathbf{Z} := \begin{bmatrix} z_{1} \\ \vdots \\ z_{k} \end{bmatrix} = \begin{bmatrix} \left( x_{11} , \cdots , x_{1 n_{1}} \right) \\ \vdots \\ \left( x_{1k} , \cdots , x_{k n_{k}} \right) \end{bmatrix} = \begin{bmatrix} x_{11} \\ \vdots \\ x_{k n_{k}} \end{bmatrix} Thus, based on the vector notation, the sum of squares in ZijZ_{ij} can be represented as follows: j=1ki=1njZij2=Q1+Q2+Q3=ZTA1Z+ZTA2Z+ZTA3Z \begin{align*} \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} Z_{ij}^{2} =& Q_{1} + Q_{2} + Q_{3} \\ =& \mathbf{Z}^{T} A_{1} \mathbf{Z} + \mathbf{Z}^{T} A_{2} \mathbf{Z} + \mathbf{Z}^{T} A_{3} \mathbf{Z} \end{align*}

Cochran’s Theorem: Let the sample X=(X1,,Xn)\mathbf{X} = \left( X_{1} , \cdots , X_{n} \right) be following an iid normal distribution as X1,,XniidN(0,σ2)X_{1} , \cdots , X_{n} \overset{\text{iid}}{\sim} N \left( 0, \sigma^{2} \right). Given a symmetric matrix A1,,AkRn×nA_{1} , \cdots , A_{k} \in \mathbb{R}^{n \times n} having a rank rjr_{j}, if a random variable Q1,,QkQ_{1} , \cdots , Q_{k} is expressed in the form of a quadratic form Qi:=XTAiXQ_{i} := \mathbf{X}^{T} A_{i} \mathbf{X} and the sum of squares of samples equals i=1nXi2=j=1kQj\sum_{i=1}^{n} X_{i}^{2} = \sum_{j=1}^{k} Q_{j}, then the following is true: j,Qjσ2χ2(rj)j1j2,Qj1Qj2    j=1krj=n \forall j , {\frac{ Q_{j} }{ \sigma^{2} }} \sim \chi^{2} \left( r_{j} \right) \land \forall j_{1} \ne j_{2} , Q_{j_{1}} \perp Q_{j_{2}} \iff \sum_{j=1}^{k} r_{j} = n In other words, the fact that QjQ_{j} are independent and follow a chi-squared distribution χ2(rj)\chi^{2} \left( r_{j} \right) is equivalent to the sum of the ranks rjr_{j} being equal to the sample size nn.

Each component in Z\mathbf{Z} is independently observed from a standard normal distribution N(0,12)N \left( 0 , 1^{2} \right) and given l=13rankAl=n\sum_{l=1}^{3} \rank A_{l} = n, so according to Cochran’s theorem, Q1Q_{1} and Q2Q_{2} follow independent chi-squared distribution as follows: Q1=Q112χ2(rankA1)=χ2(nk)Q2=Q212χ2(rankA2)=χ2(k1) \begin{align*} Q_{1} = {\frac{ Q_{1} }{ 1^{2} }} \sim& \chi^{2} \left( \rank A_{1} \right) = \chi^{2} \left( n - k \right) \\ Q_{2} = {\frac{ Q_{2} }{ 1^{2} }} \sim& \chi^{2} \left( \rank A_{2} \right) = \chi^{2} \left( k - 1 \right) \end{align*} Meanwhile, these Q1Q_{1} and Q2Q_{2} can be represented for SSE\text{SSE} and SST\text{SST} as follows: Q1=j=1ki=1nj(xijxˉjσ)2=1σ2[(n11)s12++(nk1)sk2]=1σ2SSEQ2=j=1ki=1nj(xˉjxˉσ)2=1σ2j=1knj(xˉjxˉ)2=1σ2SST \begin{align*} Q_{1} =& \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ x_{ij} - \bar{x}_{j} }{ \sigma }} \right)^{2} =& {\frac{ 1 }{ \sigma^{2} }} \left[ \left( n_{1} - 1 \right) s_{1}^{2} + \cdots + \left( n_{k} - 1 \right) s_{k}^{2} \right] =& {\frac{ 1 }{ \sigma^{2} }} \text{SSE} \\ Q_{2} =& \sum_{j=1}^{k} \sum_{i=1}^{n_{j}} \left( {\frac{ \bar{x}_{j} - \bar{x} }{ \sigma }} \right)^{2} =& {\frac{ 1 }{ \sigma^{2} }} \sum_{j=1}^{k} n_{j} \left( \bar{x}_{j} - \bar{x} \right)^{2} =& {\frac{ 1 }{ \sigma^{2} }} \text{SST} \end{align*}


Part 3. Deriving F-distribution

Deriving F-distribution from two Independent Chi-squared distributions: Suppose two random variables, U,VU,V, are independent, and assume Uχ2(r1)U \sim \chi^{2} ( r_{1}), Vχ2(r2)V \sim \chi^{2} ( r_{2}), then U/r1V/r2F(r1,r2) {{ U / r_{1} } \over { V / r_{2} }} \sim F \left( r_{1} , r_{2} \right)

F=MSTMSE=SST/(k1)SSE/(nk)=(SST/σ2)/(k1)(SSE/σ2)/(nk)=Q2/(k1)Q1/(nk)F(k1,nk) \begin{align*} F =& {\frac{ \text{MST} }{ \text{MSE} }} \\ =& {\frac{ \text{SST} / (k - 1) }{ \text{SSE} / (n - k) }} \\ =& {\frac{ \left( \text{SST} / \sigma^{2} \right) / (k - 1) }{ \left( \text{SSE} / \sigma^{2 }\right) / (n - k) }} \\ =& {\frac{ Q_{2} / (k-1)}{ Q_{1} / (n-k) }} \\ \sim& F \left( k - 1 , n - k \right) \end{align*} Hence, it is confirmed that the test statistic FF under the assumption that the null hypothesis is true follows an F-distribution.

Further Reading


  1. Mendenhall. (2012). Introduction to Probability and Statistics (13th Edition): p455. ↩︎

  2. https://statproofbook.github.io/P/anova1-f.html ↩︎