logo

Definition of Congruent Covariance 📂Mathematical Statistics

Definition of Congruent Covariance

Buildup

Let’s say we have samples drawn independently from a population with distribution X(μ,σ2)X \sim \left( \mu , \sigma^{2} \right), but these samples are actually composed of mm different populations, (μ1,σ12),,(μm,σm2)\left( \mu_{1} , \sigma_{1}^{2} \right), \cdots , \left( \mu_{m} , \sigma_{m}^{2} \right), with n1,,nmn_{1} , \cdots , n_{m} samples drawn from each, creating a collection of random samples. {X1}n1iid(μ1,σ12){Xm}nmiid(μm,σm2) \begin{align*} \left\{ X_{1} \right\}_{n_{1}} \overset{\text{iid}}{\sim} & \left( \mu_{1} , \sigma_{1}^{2} \right) \\ \vdots & \\ \left\{ X_{m} \right\}_{n_{m}} \overset{\text{iid}}{\sim} & \left( \mu_{m} , \sigma_{m}^{2} \right) \end{align*} Of course, the total number of samples is n=i=1mnin = \sum_{i=1}^{m} n_{i}. Since each population assumes iid, considering their order is meaningless, using the index XkX_{k} for the entire population makes us consider it as {Xk}k=1n(μ,σ2)\left\{ X_{k} \right\}_{k=1}^{n} \sim \left( \mu , \sigma^{2} \right), and by using the index ii, we represent the iith group’s random variable as Xiiid(μi,σi2)X_{i} \overset{\text{iid}}{\sim} \left( \mu_{i} , \sigma_{i}^{2} \right). Now, we will refer to the original large population’s mean μ\mu and σ2\sigma^{2} as the Population Pooled Mean and Population Pooled Variance, respectively. The ultimate goal of this post is not simply to examine ’the definition of pooled variance’ but to derive the sample pooled variance Sp2S_{p}^{2}, its unbiased estimator. Surprisingly, there is hardly a place online that has properly demonstrated this proof, so remember that you can see it here even if it’s not immediately needed.

Population Pooled Mean

Let’s first examine whether the population pooled mean and variance can be represented by their respective population means and variances. Following the definition of mean and variance, the population pooled mean is nμ=n1μ+nmμ=k=1nEXk=i=1mniEXi=n1μ1+nmμm \begin{align*} n \mu =& n_{1} \mu + \cdots n_{m} \mu \\ =& \sum_{k=1}^{n} E X_{k} \\ =& \sum_{i=1}^{m} n_{i} E X_{i} \\ =& n_{1} \mu_{1} + \cdots n_{m} \mu_{m} \end{align*} therefore fulfilling μ=n1μ1+nmμmn1++nm \mu = {{ n_{1} \mu_{1} + \cdots n_{m} \mu_{m} } \over { n_{1} + \cdots + n_{m} }} meaning the individual group’s population means appear as a weighted average based on the number of samples.

Population Pooled Variance

Similarly, though not iid, since XkX_{k} are independently sampled, according to the linearity of variance when independent, nσ2=n1σ2+nmσ2=k=1nE(Xkμ)2=Ek=1n(Xkμ)2=Ei=1mniXi22Ei=1mniXiμi+i=1mniμi2=Ei=1mni(Xiμi)2=n1σ12+nmσm2 \begin{align*} n \sigma^{2} =& n_{1} \sigma^{2} + \cdots n_{m} \sigma^{2} \\ =& \sum_{k=1}^{n} E \left( X_{k} - \mu \right)^{2} \\ =& E \sum_{k=1}^{n} \left( X_{k} - \mu \right)^{2} \\ =& E \sum_{i=1}^{m} n_{i} X_{i}^{2} - 2 E \sum_{i=1}^{m} n_{i} X_{i} \mu_{i} + \sum_{i=1}^{m} n_{i} \mu_{i}^{2} \\ =& E \sum_{i=1}^{m} n_{i} \left( X_{i} - \mu_{i} \right)^{2} \\ =& n_{1} \sigma_{1}^{2} + \cdots n_{m} \sigma_{m}^{2} \end{align*} this results in σ=n1σ12+nmσm2n1++nm \sigma = {{ n_{1} \sigma_{1}^{2} + \cdots n_{m} \sigma_{m}^{2} } \over { n_{1} + \cdots + n_{m} }} where each group’s population variances appear as a weighted average weighted by the number of samples. Now, in the case where samples have Homoscedasticity, i.e., assuming only the population means differ and σ=σ1==σm\sigma = \sigma_{1} = \cdots = \sigma_{m}, let’s look into the sample pooled variance Sp2S_{p}^{2}, an unbiased estimator.

Formula

Sample Pooled Variance

{X1}n1iid(μ1,σ2){Xm}nmiid(μm,σ2) \begin{align*} \left\{ X_{1} \right\}_{n_{1}} \overset{\text{iid}}{\sim} & \left( \mu_{1} , \sigma^{2} \right) \\ \vdots & \\ \left\{ X_{m} \right\}_{n_{m}} \overset{\text{iid}}{\sim} & \left( \mu_{m} , \sigma^{2} \right) \end{align*} Sample Pooled Variance Sp2S_{p}^{2} is represented as a weighted average of each sample variance S12,,Sm2S_{1}^{2} , \cdots , S_{m}^{2}, weighted by their degrees of freedom. Sp2:=(n11)S12++(nm1)Sm2(n11)++(nm1)=i=1m(ni1)Si2i=1m(ni1) S_{p}^{2} := {{ \left( n_{1} - 1 \right) S_{1}^{2} + \cdots + \left( n_{m} - 1 \right) S_{m}^{2} } \over { \left( n_{1} - 1 \right) + \cdots + \left( n_{m} - 1 \right) }} = {{ \sum_{i=1}^{m} \left( n_{i} - 1 \right) S_{i}^{2} } \over { \sum_{i=1}^{m} \left( n_{i} - 1 \right) }} Thus defined, sample pooled variance Sp2S_{p}^{2} is an unbiased estimator of the population pooled variance σ2\sigma^{2}. ESp2=σ2 E S_{p}^{2} = \sigma^{2}

Derivation

System of equations S12=1n11j=1n1(X1X1)2Sm2=1nm1j=1nm(XmXm)2 \begin{align*} S_{1}^{2} =& {{ 1 } \over { n_{1} - 1 }} \sum_{j=1}^{n_{1}} \left( X_{1} - \overline{X}_{1} \right)^{2} \\ & \vdots \\ S_{m}^{2} =& {{ 1 } \over { n_{m} - 1 }} \sum_{j=1}^{n_{m}} \left( X_{m} - \overline{X}_{m} \right)^{2} \end{align*} from which we obtain the following. Here, j=1,,nij = 1 , \cdots , n_{i} is simply an index to repeat nin_{i}, and though we won’t specifically use Xij1X_{ij_{1}} and Xij2X_{ij_{2}} for convenience, we must remember that they are independent. i=1m(ni1)Si2=i=1mj=1nm(XiXi)2=i=1mj=1nm[(Xiμi)+(μiXi)]2=i=1mj=1nm[(Xiμi)22(Xiμi)(Xiμi)+(Xiμi)2] \begin{align*} & \sum_{i=1}^{m} \left( n_{i} - 1 \right) S_{i}^{2} \\ =& \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} \left( X_{i} - \overline{X}_{i} \right)^{2} \\ =& \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} \left[ \left( X_{i} - \mu_{i} \right) + \left( \mu_{i} - \overline{X}_{i} \right) \right]^{2} \\ =& \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} \left[ \left( X_{i} - \mu_{i} \right)^{2} - 2 \left( X_{i} - \mu_{i} \right) \left( \overline{X}_{i} - \mu_{i} \right) + \left( \overline{X}_{i} - \mu_{i} \right)^{2} \right] \end{align*}

Now, before taking the expected value on both sides, let’s examine the expected value for each term.

Properties of Covariance: For random variables XX and YY, with means μX\mu_{X} and μY\mu_{Y} respectively, Cov(X,Y):=E[(XμX)(YμY)]\operatorname{Cov} (X ,Y) : = E \left[ ( X - \mu_{X} ) ( Y - \mu_{Y} ) \right] is defined as the Covariance between XX and YY. Covariance has the following properties:

  • [1]: Var(X)=Cov(X,X)\operatorname{Var} (X) = \operatorname{Cov} (X,X)
  • [4]: Cov(X+Y,Z)=Cov(X,Z)+Cov(Y,Z)\operatorname{Cov} (X + Y , Z ) = \operatorname{Cov}(X,Z) + \operatorname{Cov}(Y,Z)
  • [5]: Cov(aX+b,cY+d)=acCov(X,Y)\operatorname{Cov} (aX + b , cY + d ) = ac \operatorname{Cov}(X,Y)

The first term is trivially E(Xiμi)2=σi2E \left( X_{i} - \mu_{i} \right)^{2} = \sigma_{i}^{2}. And Xi=jXi/ni\overline{X}_{i} = \sum_{j} X_{i} / n_{i} since Xij1Xij2X_{ij_{1}} \perp X_{ij_{2}} are iid, or independently drawn, we can state for any j0{1,,nm}j_{0} \in \left\{ 1, \cdots, n_{m} \right\}, E(Xiμi)(Xiμi)=Cov(Xi,Xi)=Cov(Xi,Xini)+jj0Cov(Xij,Xij0ni)=1niCov(Xi,Xi)+0=1niVarXi=1niσi2 \begin{align*} & E \left( X_{i} - \mu_{i} \right) \left( \overline{X}_{i} - \mu_{i} \right) \\ =& \operatorname{Cov} \left( X_{i} , \overline{X}_{i} \right) \\ =& \operatorname{Cov} \left( X_{i} , {{ {X}_{i} } \over { n_{i} }} \right) + \sum_{j \ne j_{0}} \operatorname{Cov} \left( X_{ij} , {{ {X}_{ij_{0}} } \over { n_{i} }} \right) \\ =& {{ 1 } \over { n_{i} }} \operatorname{Cov} \left( X_{i} , X_{i} \right) + 0 \\ =& {{ 1 } \over { n_{i} }} \operatorname{Var} X_{i} \\ =& {{ 1 } \over { n_{i} }} \sigma_{i}^{2} \end{align*} and according to the standard error formula for sample means, E(Xiμi)2=1niσi2 E \left( \overline{X}_{i} - \mu_{i} \right)^{2} = {{ 1 } \over { n_{i} }} \sigma_{i}^{2} thus, assuming homoscedasticity σ=σ1==σm\sigma = \sigma_{1} = \cdots = \sigma_{m}, Ei=1m(ni1)Si2=i=1mj=1nmσi22i=1mj=1nm1niσi2+i=1mj=1nm1niσi2=nσ2i=1mσi2=nσ2mσ2=(nm)σ2=i=1m(ni1)σ2 \begin{align*} & E \sum_{i=1}^{m} \left( n_{i} - 1 \right) S_{i}^{2} \\ =& \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} \sigma_{i}^{2} - 2 \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} {{ 1 } \over { n_{i} }} \sigma_{i}^{2} + \sum_{i=1}^{m} \sum_{j = 1}^{n_{m}} {{ 1 } \over { n_{i} }} \sigma_{i}^{2} \\ =& n \sigma^{2} - \sum_{i=1}^{m} \sigma_{i}^{2} \\ =& n \sigma^{2} - m \sigma^{2} \\ =& (n-m) \sigma^{2} \\ =& \sum_{i=1}^{m} \left( n_{i} - 1 \right) \sigma^{2} \end{align*} leading us to the final result: ESp2=Ei=1m(ni1)Si2i=1m(ni1)=σ2 E S_{p}^{2} = E {{ \sum_{i=1}^{m} \left( n_{i} - 1 \right) S_{i}^{2} } \over { \sum_{i=1}^{m} \left( n_{i} - 1 \right) }} = \sigma^{2}