Large Sample Hypothesis Testing for the Difference Between Two Population Means 📂Statistical Test

Large Sample Hypothesis Testing for the Difference Between Two Population Means

Hypothesis Testing ¹

Let’s say two independent populations follow distributions $\left( \mu_{1} , \sigma_{1}^{2} \right)$ and $\left( \mu_{2} , \sigma_{2}^{2} \right)$ , respectively. In the case of a large sample, meaning the sample size is $n_{1} , n_{2} > 30$ , the hypothesis test about the difference between the two population means against candidate $D_{0}$ is as follows:

$H_{0}$ : $\mu_{1} - \mu_{2} = D_{0}$ . That is, the difference in population means is $D_{0}$ .
$H_{1}$ : $\mu_{1} - \mu_{2} = D_{0}$ . That is, the difference in population means is not $D_{0}$ .

test statistic

The test statistic is calculated slightly differently depending on whether the population standard deviation $\sigma_{1}, \sigma_{2}$ is known or not.

When $\sigma_{1} , \sigma_{2}$ is known: Use the population standard deviation $\sigma_{1} , \sigma_{2}$ as it is. $Z = {{ \left( \overline{X}_{1} - \overline{X}_{2} \right) - D_{0} } \over { \sqrt{ {{ \sigma_{1}^{2} } \over { n_{1} }} + {{ \sigma_{2}^{2} } \over { n_{2} }} } }}$
When $\sigma_{1}, \sigma_{2}$ is unknown: Use the sample standard deviation $s_{1}, s_{2}$ as follows. $Z = {{ \left( \overline{X}_{1} - \overline{X}_{2} \right) - D_{0} } \over { \sqrt{ {{ s_{1}^{2} } \over { n_{1} }} + {{ s_{2}^{2} } \over { n_{2} }} } }}$

Explanation

Probably $D_{0}$ is mostly used for $0$ , because in many cases, the crucial part one wants to know is ‘whether the two distributions have the same population mean’ rather than ‘how exact is the difference’. The most complex-looking part in the form of the test statistic is the denominator $\sqrt{ {{ \sigma_{1}^{2} } \over { n_{1} }} + {{ \sigma_{2}^{2} } \over { n_{2} }} }$ , which becomes fun to study once you learn its derivation after studying Mathematical Statistics. Unfortunately, freshmen have to memorize it.

Derivation

Central Limit Theorem: If $\left\{ X_{k} \right\}_{k=1}^{n}$ are iid random variables following the distribution $\left( \mu, \sigma^2 \right)$ , then when $n \to \infty$ $\sqrt{n} {{ \overline{X} - \mu } \over {\sigma}} \overset{D}{\to} N (0,1)$

Since we assume a large sample from both populations, regardless of the population distribution, $\overline{X}_{1}, \overline{X}_{2}$ follows the normal distribution according to the Central Limit Theorem. $\begin{align*} \overline{X}_{1} =& {{ 1 } \over { n_{1} }} \sum_{k=1}^{n_{1}} X_{1} \sim N \left( \mu_{1} , {{ \sigma_{1}^{2} } \over { n_{1} }} \right) \\ \overline{X}_{2} =& {{ 1 } \over { n_{2} }} \sum_{k=1}^{n_{2}} X_{2} \sim N \left( \mu_{2} , {{ \sigma_{2}^{2} } \over { n_{2} }} \right) \end{align*}$

Sum of Normal Distributions: Let’s say the random variables $X_{1} , \cdots , X_{n}$ are mutually independent.
If $X_i \sim N( \mu_{i}, \sigma_{i}^{2} )$ then given vector $(a_{1} , \cdots , a_{n}) \in \mathbb{R}^{n}$ $\displaystyle \sum_{i=1}^{n} a_{i} X_{i} \sim N \left( \sum_{i=1}^{n} a_{i } \mu_{i} , \sum_{i=1}^{n} a_{i }^2 \sigma_{i}^2 \right)$

Given vector $\left( a_{1} , a_{2} \right) = (1, -1) \in \mathbb{R}^{2}$ $\begin{align*} \overline{X}_{1} - \overline{X}_{2} =& a_{1} \overline{X}_{1} + a_{2} \overline{X}_{2} \\ \sim& N \left( \sum_{i=1}^{2} a_{i } \mu_{i} , \sum_{i=1}^{2} a_{i }^2 \sigma_{i}^2 \right) \\ \sim& N \left( \mu_{1} - \mu_{2} , 1^{2} \cdot {{ \sigma^{2} } \over { n_{1} }} + (-1)^{2} \cdot {{ \sigma^{2} } \over { n_{2} }} \right) \end{align*}$ thus under the null hypothesis $H_{0} : \mu_{1} - \mu_{2} = D_{0}$ $Z = {{ \left( \overline{X}_{1} - \overline{X}_{2} \right) - D_{0} } \over { \sqrt{ {{ \sigma_{1}^{2} } \over { n_{1} }} + {{ \sigma_{2}^{2} } \over { n_{2} }} } }} \sim N \left( 0, 1 \right)$ follows a distribution almost approximated to the standard normal distribution $N (0,1)$ . Similarly, in the case of a large sample $s \approx \sigma$ , when the population variance is unknown, it’s acceptable to use $s$ instead of $\sigma$ . When the random variable $Y$ follows the standard normal distribution, for the significance level $\alpha$ , rejecting $H_{0}$ when $P \left( Y \ge z_{\alpha} \right) = \alpha$ for $z_{\alpha}$ is equivalent to the following. $\left| Z \right| \ge z_{\alpha}$ This means that it’s too implausible to believe in $\mu_{1} - \mu_{2} = D_{0}$ under the null hypothesis, as $\overline{X}_{1} - \overline{X}_{2}$ is too far from $D_{0}$ .

■

Mendenhall. (2012). Introduction to Probability and Statistics (13th Edition): p363. ↩︎