logo

Large Sample Hypothesis Testing for the Difference Between Two Population Means 📂Statistical Test

Large Sample Hypothesis Testing for the Difference Between Two Population Means

Hypothesis Testing 1

Let’s say two independent populations follow distributions (μ1,σ12)\left( \mu_{1} , \sigma_{1}^{2} \right) and (μ2,σ22)\left( \mu_{2} , \sigma_{2}^{2} \right), respectively. In the case of a large sample, meaning the sample size is n1,n2>30n_{1} , n_{2} > 30, the hypothesis test about the difference between the two population means against candidate D0D_{0} is as follows:

  • H0H_{0}: μ1μ2=D0\mu_{1} - \mu_{2} = D_{0}. That is, the difference in population means is D0D_{0}.
  • H1H_{1}: μ1μ2=D0\mu_{1} - \mu_{2} = D_{0}. That is, the difference in population means is not D0D_{0}.

test statistic

The test statistic is calculated slightly differently depending on whether the population standard deviation σ1,σ2\sigma_{1}, \sigma_{2} is known or not.

  • When σ1,σ2\sigma_{1} , \sigma_{2} is known: Use the population standard deviation σ1,σ2\sigma_{1} , \sigma_{2} as it is. Z=(X1X2)D0σ12n1+σ22n2 Z = {{ \left( \overline{X}_{1} - \overline{X}_{2} \right) - D_{0} } \over { \sqrt{ {{ \sigma_{1}^{2} } \over { n_{1} }} + {{ \sigma_{2}^{2} } \over { n_{2} }} } }}
  • When σ1,σ2\sigma_{1}, \sigma_{2} is unknown: Use the sample standard deviation s1,s2s_{1}, s_{2} as follows. Z=(X1X2)D0s12n1+s22n2 Z = {{ \left( \overline{X}_{1} - \overline{X}_{2} \right) - D_{0} } \over { \sqrt{ {{ s_{1}^{2} } \over { n_{1} }} + {{ s_{2}^{2} } \over { n_{2} }} } }}

Explanation

Probably D0D_{0} is mostly used for 00, because in many cases, the crucial part one wants to know is ‘whether the two distributions have the same population mean’ rather than ‘how exact is the difference’. The most complex-looking part in the form of the test statistic is the denominator σ12n1+σ22n2\sqrt{ {{ \sigma_{1}^{2} } \over { n_{1} }} + {{ \sigma_{2}^{2} } \over { n_{2} }} }, which becomes fun to study once you learn its derivation after studying Mathematical Statistics. Unfortunately, freshmen have to memorize it.

Derivation

Central Limit Theorem: If {Xk}k=1n\left\{ X_{k} \right\}_{k=1}^{n} are iid random variables following the distribution (μ,σ2)\left( \mu, \sigma^2 \right) , then when nn \to \infty nXμσDN(0,1) \sqrt{n} {{ \overline{X} - \mu } \over {\sigma}} \overset{D}{\to} N (0,1)

Since we assume a large sample from both populations, regardless of the population distribution, X1,X2\overline{X}_{1}, \overline{X}_{2} follows the normal distribution according to the Central Limit Theorem. X1=1n1k=1n1X1N(μ1,σ12n1)X2=1n2k=1n2X2N(μ2,σ22n2) \begin{align*} \overline{X}_{1} =& {{ 1 } \over { n_{1} }} \sum_{k=1}^{n_{1}} X_{1} \sim N \left( \mu_{1} , {{ \sigma_{1}^{2} } \over { n_{1} }} \right) \\ \overline{X}_{2} =& {{ 1 } \over { n_{2} }} \sum_{k=1}^{n_{2}} X_{2} \sim N \left( \mu_{2} , {{ \sigma_{2}^{2} } \over { n_{2} }} \right) \end{align*}

Sum of Normal Distributions: Let’s say the random variables X1,,XnX_{1} , \cdots , X_{n} are mutually independent.

  • If XiN(μi,σi2)X_i \sim N( \mu_{i}, \sigma_{i}^{2} ) then given vector (a1,,an)Rn(a_{1} , \cdots , a_{n}) \in \mathbb{R}^{n} i=1naiXiN(i=1naiμi,i=1nai2σi2) \displaystyle \sum_{i=1}^{n} a_{i} X_{i} \sim N \left( \sum_{i=1}^{n} a_{i } \mu_{i} , \sum_{i=1}^{n} a_{i }^2 \sigma_{i}^2 \right)

Given vector (a1,a2)=(1,1)R2\left( a_{1} , a_{2} \right) = (1, -1) \in \mathbb{R}^{2} X1X2=a1X1+a2X2N(i=12aiμi,i=12ai2σi2)N(μ1μ2,12σ2n1+(1)2σ2n2) \begin{align*} \overline{X}_{1} - \overline{X}_{2} =& a_{1} \overline{X}_{1} + a_{2} \overline{X}_{2} \\ \sim& N \left( \sum_{i=1}^{2} a_{i } \mu_{i} , \sum_{i=1}^{2} a_{i }^2 \sigma_{i}^2 \right) \\ \sim& N \left( \mu_{1} - \mu_{2} , 1^{2} \cdot {{ \sigma^{2} } \over { n_{1} }} + (-1)^{2} \cdot {{ \sigma^{2} } \over { n_{2} }} \right) \end{align*} thus under the null hypothesis H0:μ1μ2=D0H_{0} : \mu_{1} - \mu_{2} = D_{0} Z=(X1X2)D0σ12n1+σ22n2N(0,1) Z = {{ \left( \overline{X}_{1} - \overline{X}_{2} \right) - D_{0} } \over { \sqrt{ {{ \sigma_{1}^{2} } \over { n_{1} }} + {{ \sigma_{2}^{2} } \over { n_{2} }} } }} \sim N \left( 0, 1 \right) follows a distribution almost approximated to the standard normal distribution N(0,1)N (0,1). Similarly, in the case of a large sample sσs \approx \sigma, when the population variance is unknown, it’s acceptable to use ss instead of σ\sigma. When the random variable YY follows the standard normal distribution, for the significance level α\alpha, rejecting H0H_{0} when P(Yzα)=αP \left( Y \ge z_{\alpha} \right) = \alpha for zαz_{\alpha} is equivalent to the following. Zzα \left| Z \right| \ge z_{\alpha} This means that it’s too implausible to believe in μ1μ2=D0\mu_{1} - \mu_{2} = D_{0} under the null hypothesis, as X1X2\overline{X}_{1} - \overline{X}_{2} is too far from D0D_{0}.


  1. Mendenhall. (2012). Introduction to Probability and Statistics (13th Edition): p363. ↩︎