logo

Reason for Dividing the Sample Variance by n-1 📂Mathematical Statistics

Reason for Dividing the Sample Variance by n-1

Why divide by n-1?

If we denote it as Xi(μ,σ2)X_{i} \sim \left( \mu , \sigma^{2} \right), the sample variance S2S^{2} can be represented as follows. S2:=1n1i=1n(XiX)2 S^{2} := {{1} \over {n-1}} \sum_{i=1}^{n} \left( X_{i} - \overline{X} \right)^{2}

As is well known, unlike the sample mean, the sample variance sums up the squares of the deviations and then divides not by the sample size nn, but by n1n-1. While I won’t say one must find this odd, if there is a universal sentiment about formulas, it is also normal to feel a strong displeasure about adding up nn items and dividing by n1n-1. The explanations briefly divide into levels of high school students, freshmen in statistics, and graduates in statistics.

It should be noted, except for the explanation at the level of graduates in statistics, the rest mix in mathematical fiction, so it is fine if you cannot accept or find something strange. Additionally, depending on the field of statistics, there are occasions when the sample variance is calculated by dividing by nn instead of n1n-1, so rather than debating right or wrong, it is vital to understand the accurate reason.

Explanation at the high school level

When n=1n=1, the denominator becomes 00, to make it undefined. Basically, since nn and n1n-1 only differ by 11, if the sample size increases, there is no significant difference whether it is divided by nn or n1n-1. The problem is when the sample size is small, especially if n=1n=1, meaning there is only one sample. Variance talks about the spread of data, and it is common sense that how much a sole data point is spread cannot be defined at all.

If only a single data point x1x_{1} is given, then the sample mean is also x=x1\overline{x} = x_{1}, hence the sample variance becomes s2=0s^{2} = 0, which might be interpreted as ’not spread at all’. However, this is true even when the sample size increases. Suppose all data points are exactly the same; in that case, x=x1==xn\overline{x} = x_{1} = \cdots = x_{n}, so ultimately, the sample variance can become s2=0s^{2} = 0. Numerically not being spread and actually unable to be spread are fundamentally different in this sense. When calculating the sample variance by dividing by n1n-1, such long explanations are summarized into a short formula.

Since the variance has never been defined as ’the average of the squares of deviations’, if we do not use the term expected value, there isn’t exactly a reason to divide by nn, and technically, if it just comes down to ’the expected value of the squared deviations’, dividing by any absurd numbers like n+1n+1 or n7n-7 wouldn’t matter. However, if you wish to argue that ‘it does not make sense to define variance for a single sample’, that number has to be precisely n1n-1.

Explanation at the freshman level in statistics

nn is merely the sample size, and precisely n1n-1 is the degrees of freedom. It may seem like data uses nn items because we add up XiX_{i} nn times, but once x=i=1nxi/n\overline{x} = \sum_{i=1}^{n} x_{i} / n is given, x1,,xn1x_{1} , \cdots , x_{n-1} allows us to calculate backward to specify the constant xnx_{n}. In other words, despite the appearance of the formula, the actual number of data points used in the variance calculation is not nn but one less, which is n1n-1. This is termed degrees of freedom, and in this context, it can be understood as ’the actual sample size used’. Hence, when calculating the sample variance, it should be divided by the degrees of freedom n1n-1, not the meaningless sample size nn.

Explanation at the graduate level

Simply put, it’s to ensure the expected value of the sample variance equals the population variance. More complexly, the sample variance becomes an unbiased estimator only when divided by n1n-1. An unbiased estimator is an estimator without bias, and indeed, unfolding the formula shows that dividing by n1n-1 appropriately resolves the issues arising in the process of dealing with X\overline{X}. The summarised formula goes as follows. ES2=(n1)1Ei=1n(XiX)2=(n1)1[i=1n(σ2+μ2)n(μ2+σ2n)]=(n1)1(n1)σ2=σ2 \begin{align*} E S^{2} =& (n-1)^{-1} E \sum_{i=1}^{n} \left( X_{i} - \overline{X} \right)^{2} \\ =& (n-1)^{-1} \left[ \sum_{i=1}^{n} \left( \sigma^{2} + \mu^{2} \right) - n \left( \mu^{2} + {{\sigma^{2}} \over {n}} \right) \right] \\ =& (n-1)^{-1} (n-1) \sigma^{2} \\ =& \sigma^{2} \end{align*} Since this formula derivation is significantly simplified, it is recommended to refer to the detailed proof for a proper understanding. According to this mathematical discussion, dividing the sample variance by n1n-1 is not an improvement over dividing by nn but a necessity with inherent justification. If the calculated sample variance is not expected to be the actual population variance, there is no reason to call it sample variance.

If one understands up to the graduate level and then revisits the explanations for high school and freshman levels, it is clear that the explanations are considerably flawed. While the high school level explanation of dividing by n1n-1 to prevent variance from being defined for a single sample needs consensus on the premise that it should not be defined for one sample, even if freshmen understand that degrees of freedom is n1n-1, there is no clear explanation on why it should be divided by the degrees of freedom. It is merely a method of explaining without heavily relying on formulas so the listener can intuitively understand.

Conversely, if one has studied and claims to understand statistics to such an extent, at the least, mathematical literacy to handle such formulas is required. If you are satisfied with your level and feel it is sufficient, you may focus on your primary field, but if something feels lacking, invest time and effort to naturally understand the formulas.