Definition of Congruent Covariance
Buildup
Let’s say we have samples drawn independently from a population with distribution $X \sim \left( \mu , \sigma^{2} \right)$, but these samples are actually composed of $m$ different populations, $\left( \mu_{1} , \sigma_{1}^{2} \right), \cdots , \left( \mu_{m} , \sigma_{m}^{2} \right)$, with $n_{1} , \cdots , n_{m}$ samples drawn from each, creating a collection of random samples. $$ \begin{align*} \left\{ X_{1} \right\}_{n_{1}} \overset{\text{iid}}{\sim} & \left( \mu_{1} , \sigma_{1}^{2} \right) \\ \vdots & \\ \left\{ X_{m} \right\}_{n_{m}} \overset{\text{iid}}{\sim} & \left( \mu_{m} , \sigma_{m}^{2} \right) \end{align*} $$ Of course, the total number of samples is $n = \sum_{i=1}^{m} n_{i}$. Since each population assumes iid, considering their order is meaningless, using the index $X_{k}$ for the entire population makes us consider it as $\left\{ X_{k} \right\}_{k=1}^{n} \sim \left( \mu , \sigma^{2} \right)$, and by using the index $i$, we represent the $i$th group’s random variable as $X_{i} \overset{\text{iid}}{\sim} \left( \mu_{i} , \sigma_{i}^{2} \right)$. Now, we will refer to the original large population’s mean $\mu$ and $\sigma^{2}$ as the Population Pooled Mean and Population Pooled Variance, respectively. The ultimate goal of this post is not simply to examine ’the definition of pooled variance’ but to derive the sample pooled variance $S_{p}^{2}$, its unbiased estimator. Surprisingly, there is hardly a place online that has properly demonstrated this proof, so remember that you can see it here even if it’s not immediately needed.
Population Pooled Mean
Let’s first examine whether the population pooled mean and variance can be represented by their respective population means and variances. Following the definition of mean and variance, the population pooled mean is $$ \begin{align*} n \mu =& n_{1} \mu + \cdots n_{m} \mu \\ =& \sum_{k=1}^{n} E X_{k} \\ =& \sum_{i=1}^{m} n_{i} E X_{i} \\ =& n_{1} \mu_{1} + \cdots n_{m} \mu_{m} \end{align*} $$ therefore fulfilling $$ \mu = {{ n_{1} \mu_{1} + \cdots n_{m} \mu_{m} } \over { n_{1} + \cdots + n_{m} }} $$ meaning the individual group’s population means appear as a weighted average based on the number of samples.
Population Pooled Variance
Similarly, though not iid, since $X_{k}$ are independently sampled, according to the linearity of variance when independent, $$ \begin{align*} n \sigma^{2} =& n_{1} \sigma^{2} + \cdots n_{m} \sigma^{2} \\ =& \sum_{k=1}^{n} E \left( X_{k} - \mu \right)^{2} \\ =& E \sum_{k=1}^{n} \left( X_{k} - \mu \right)^{2} \\ =& E \sum_{i=1}^{m} n_{i} X_{i}^{2} - 2 E \sum_{i=1}^{m} n_{i} X_{i} \mu_{i} + \sum_{i=1}^{m} n_{i} \mu_{i}^{2} \\ =& E \sum_{i=1}^{m} n_{i} \left( X_{i} - \mu_{i} \right)^{2} \\ =& n_{1} \sigma_{1}^{2} + \cdots n_{m} \sigma_{m}^{2} \end{align*} $$ this results in $$ \sigma = {{ n_{1} \sigma_{1}^{2} + \cdots n_{m} \sigma_{m}^{2} } \over { n_{1} + \cdots + n_{m} }} $$ where each group’s population variances appear as a weighted average weighted by the number of samples. Now, in the case where samples have Homoscedasticity, i.e., assuming only the population means differ and $\sigma = \sigma_{1} = \cdots = \sigma_{m}$, let’s look into the sample pooled variance $S_{p}^{2}$, an unbiased estimator.
Formula
Sample Pooled Variance
$$ \begin{align*} \left\{ X_{1} \right\}_{n_{1}} \overset{\text{iid}}{\sim} & \left( \mu_{1} , \sigma^{2} \right) \\ \vdots & \\ \left\{ X_{m} \right\}_{n_{m}} \overset{\text{iid}}{\sim} & \left( \mu_{m} , \sigma^{2} \right) \end{align*} $$ Sample Pooled Variance $S_{p}^{2}$ is represented as a weighted average of each sample variance $S_{1}^{2} , \cdots , S_{m}^{2}$, weighted by their degrees of freedom. $$ S_{p}^{2} := {{ \left( n_{1} - 1 \right) S_{1}^{2} + \cdots + \left( n_{m} - 1 \right) S_{m}^{2} } \over { \left( n_{1} - 1 \right) + \cdots + \left( n_{m} - 1 \right) }} = {{ \sum_{i=1}^{m} \left( n_{i} - 1 \right) S_{i}^{2} } \over { \sum_{i=1}^{m} \left( n_{i} - 1 \right) }} $$ Thus defined, sample pooled variance $S_{p}^{2}$ is an unbiased estimator of the population pooled variance $\sigma^{2}$. $$ E S_{p}^{2} = \sigma^{2} $$
Derivation
System of equations $$ \begin{align*} S_{1}^{2} =& {{ 1 } \over { n_{1} - 1 }} \sum_{j=1}^{n_{1}} \left( X_{1} - \overline{X}_{1} \right)^{2} \\ & \vdots \\ S_{m}^{2} =& {{ 1 } \over { n_{m} - 1 }} \sum_{j=1}^{n_{m}} \left( X_{m} - \overline{X}_{m} \right)^{2} \end{align*} $$ from which we obtain the following. Here, $j = 1 , \cdots , n_{i}$ is simply an index to repeat $n_{i}$, and though we won’t specifically use $X_{ij_{1}}$ and $X_{ij_{2}}$ for convenience, we must remember that they are independent. $$ \begin{align*} & \sum_{i=1}^{m} \left( n_{i} - 1 \right) S_{i}^{2} \\ =& \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} \left( X_{i} - \overline{X}_{i} \right)^{2} \\ =& \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} \left[ \left( X_{i} - \mu_{i} \right) + \left( \mu_{i} - \overline{X}_{i} \right) \right]^{2} \\ =& \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} \left[ \left( X_{i} - \mu_{i} \right)^{2} - 2 \left( X_{i} - \mu_{i} \right) \left( \overline{X}_{i} - \mu_{i} \right) + \left( \overline{X}_{i} - \mu_{i} \right)^{2} \right] \end{align*} $$
Now, before taking the expected value on both sides, let’s examine the expected value for each term.
Properties of Covariance: For random variables $X$ and $Y$, with means $\mu_{X}$ and $\mu_{Y}$ respectively, $\operatorname{Cov} (X ,Y) : = E \left[ ( X - \mu_{X} ) ( Y - \mu_{Y} ) \right]$ is defined as the Covariance between $X$ and $Y$. Covariance has the following properties:
- [1]: $\operatorname{Var} (X) = \operatorname{Cov} (X,X)$
- [4]: $\operatorname{Cov} (X + Y , Z ) = \operatorname{Cov}(X,Z) + \operatorname{Cov}(Y,Z)$
- [5]: $\operatorname{Cov} (aX + b , cY + d ) = ac \operatorname{Cov}(X,Y)$
The first term is trivially $E \left( X_{i} - \mu_{i} \right)^{2} = \sigma_{i}^{2}$. And $\overline{X}_{i} = \sum_{j} X_{i} / n_{i}$ since $X_{ij_{1}} \perp X_{ij_{2}}$ are iid, or independently drawn, we can state for any $j_{0} \in \left\{ 1, \cdots, n_{m} \right\}$, $$ \begin{align*} & E \left( X_{i} - \mu_{i} \right) \left( \overline{X}_{i} - \mu_{i} \right) \\ =& \operatorname{Cov} \left( X_{i} , \overline{X}_{i} \right) \\ =& \operatorname{Cov} \left( X_{i} , {{ {X}_{i} } \over { n_{i} }} \right) + \sum_{j \ne j_{0}} \operatorname{Cov} \left( X_{ij} , {{ {X}_{ij_{0}} } \over { n_{i} }} \right) \\ =& {{ 1 } \over { n_{i} }} \operatorname{Cov} \left( X_{i} , X_{i} \right) + 0 \\ =& {{ 1 } \over { n_{i} }} \operatorname{Var} X_{i} \\ =& {{ 1 } \over { n_{i} }} \sigma_{i}^{2} \end{align*} $$ and according to the standard error formula for sample means, $$ E \left( \overline{X}_{i} - \mu_{i} \right)^{2} = {{ 1 } \over { n_{i} }} \sigma_{i}^{2} $$ thus, assuming homoscedasticity $\sigma = \sigma_{1} = \cdots = \sigma_{m}$, $$ \begin{align*} & E \sum_{i=1}^{m} \left( n_{i} - 1 \right) S_{i}^{2} \\ =& \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} \sigma_{i}^{2} - 2 \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} {{ 1 } \over { n_{i} }} \sigma_{i}^{2} + \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} {{ 1 } \over { n_{i} }} \sigma_{i}^{2} \\ =& n \sigma^{2} - \sum_{i=1}^{m} \sigma_{i}^{2} \\ =& n \sigma^{2} - m \sigma^{2} \\ =& (n-m) \sigma^{2} \\ =& \sum_{i=1}^{m} \left( n_{i} - 1 \right) \sigma^{2} \end{align*} $$ leading us to the final result: $$ E S_{p}^{2} = E {{ \sum_{i=1}^{m} \left( n_{i} - 1 \right) S_{i}^{2} } \over { \sum_{i=1}^{m} \left( n_{i} - 1 \right) }} = \sigma^{2} $$
■