Reason for Dividing the Sample Variance by n-1 📂Mathematical Statistics

Reason for Dividing the Sample Variance by n-1

Why divide by n-1?

If we denote it as $X_{i} \sim \left( \mu , \sigma^{2} \right)$ , the sample variance $S^{2}$ can be represented as follows. $S^{2} := {{1} \over {n-1}} \sum_{i=1}^{n} \left( X_{i} - \overline{X} \right)^{2}$

As is well known, unlike the sample mean, the sample variance sums up the squares of the deviations and then divides not by the sample size $n$ , but by $n-1$ . While I won’t say one must find this odd, if there is a universal sentiment about formulas, it is also normal to feel a strong displeasure about adding up $n$ items and dividing by $n-1$ . The explanations briefly divide into levels of high school students, freshmen in statistics, and graduates in statistics.

It should be noted, except for the explanation at the level of graduates in statistics, the rest mix in mathematical fiction, so it is fine if you cannot accept or find something strange. Additionally, depending on the field of statistics, there are occasions when the sample variance is calculated by dividing by $n$ instead of $n-1$ , so rather than debating right or wrong, it is vital to understand the accurate reason.

Explanation at the high school level

When $n=1$ , the denominator becomes $0$ , to make it undefined. Basically, since $n$ and $n-1$ only differ by $1$ , if the sample size increases, there is no significant difference whether it is divided by $n$ or $n-1$ . The problem is when the sample size is small, especially if $n=1$ , meaning there is only one sample. Variance talks about the spread of data, and it is common sense that how much a sole data point is spread cannot be defined at all.

If only a single data point $x_{1}$ is given, then the sample mean is also $\overline{x} = x_{1}$ , hence the sample variance becomes $s^{2} = 0$ , which might be interpreted as ’not spread at all’. However, this is true even when the sample size increases. Suppose all data points are exactly the same; in that case, $\overline{x} = x_{1} = \cdots = x_{n}$ , so ultimately, the sample variance can become $s^{2} = 0$ . Numerically not being spread and actually unable to be spread are fundamentally different in this sense. When calculating the sample variance by dividing by $n-1$ , such long explanations are summarized into a short formula.

Since the variance has never been defined as ’the average of the squares of deviations’, if we do not use the term expected value, there isn’t exactly a reason to divide by $n$ , and technically, if it just comes down to ’the expected value of the squared deviations’, dividing by any absurd numbers like $n+1$ or $n-7$ wouldn’t matter. However, if you wish to argue that ‘it does not make sense to define variance for a single sample’, that number has to be precisely $n-1$ .

Explanation at the freshman level in statistics

$n$ is merely the sample size, and precisely $n-1$ is the degrees of freedom. It may seem like data uses $n$ items because we add up $X_{i}$ $n$ times, but once $\overline{x} = \sum_{i=1}^{n} x_{i} / n$ is given, $x_{1} , \cdots , x_{n-1}$ allows us to calculate backward to specify the constant $x_{n}$ . In other words, despite the appearance of the formula, the actual number of data points used in the variance calculation is not $n$ but one less, which is $n-1$ . This is termed degrees of freedom, and in this context, it can be understood as ’the actual sample size used’. Hence, when calculating the sample variance, it should be divided by the degrees of freedom $n-1$ , not the meaningless sample size $n$ .

Explanation at the graduate level

Simply put, it’s to ensure the expected value of the sample variance equals the population variance. More complexly, the sample variance becomes an unbiased estimator only when divided by $n-1$ . An unbiased estimator is an estimator without bias, and indeed, unfolding the formula shows that dividing by $n-1$ appropriately resolves the issues arising in the process of dealing with $\overline{X}$ . The summarised formula goes as follows. $\begin{align*} E S^{2} =& (n-1)^{-1} E \sum_{i=1}^{n} \left( X_{i} - \overline{X} \right)^{2} \\ =& (n-1)^{-1} \left[ \sum_{i=1}^{n} \left( \sigma^{2} + \mu^{2} \right) - n \left( \mu^{2} + {{\sigma^{2}} \over {n}} \right) \right] \\ =& (n-1)^{-1} (n-1) \sigma^{2} \\ =& \sigma^{2} \end{align*}$ Since this formula derivation is significantly simplified, it is recommended to refer to the detailed proof for a proper understanding. According to this mathematical discussion, dividing the sample variance by $n-1$ is not an improvement over dividing by $n$ but a necessity with inherent justification. If the calculated sample variance is not expected to be the actual population variance, there is no reason to call it sample variance.

If one understands up to the graduate level and then revisits the explanations for high school and freshman levels, it is clear that the explanations are considerably flawed. While the high school level explanation of dividing by $n-1$ to prevent variance from being defined for a single sample needs consensus on the premise that it should not be defined for one sample, even if freshmen understand that degrees of freedom is $n-1$ , there is no clear explanation on why it should be divided by the degrees of freedom. It is merely a method of explaining without heavily relying on formulas so the listener can intuitively understand.

Conversely, if one has studied and claims to understand statistics to such an extent, at the least, mathematical literacy to handle such formulas is required. If you are satisfied with your level and feel it is sufficient, you may focus on your primary field, but if something feels lacking, invest time and effort to naturally understand the formulas.