Estimation of the Variance of Residuals and Standard Errors of Regression Coefficients in Multiple Regression Analysis
Theorem
$$ \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{bmatrix} $$ When there are $p$ independent variables and $n$ pieces of data, the linear multiple regression model can be represented by a design matrix as shown above, and let’s simply express it as $Y = X \beta + \varepsilon$. Since the estimate of the regression coefficient is $\hat{\beta} = \left( X^{T} X \right)^{-1} X^{T} Y$, the vector $\hat{Y}$ of fitted values $$ \hat{Y} = X \hat{\beta} = X \left( X^{T} X \right)^{-1} X^{T} Y $$ is known. For convenience, let’s set it as $P := X \left( X^{T} X \right)^{-1} X^{T}$. Meanwhile, it is assumed that residuals have linearity, i.e., the population mean of $\varepsilon_{1} , \cdots , \varepsilon_{n}$ is $0$.
Expected Value of the Sum of Squared Residuals
- [1]: If the residuals also have homoscedasticity, i.e., if for some constant $\sigma > 0$, $\varepsilon_{1} , \cdots , \varepsilon_{n} \sim \left( 0, \sigma^{2} \right)$ holds, then the expected value of $SSE$ is as follows. $$ E \left( SSE \right) = E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] = n \sigma^{2} - \sum_{i,j} E \left( y_{i} y_{j} \right) P_{ij} $$
Unbiased Estimator for the Variance of the Sum of Squared Residuals
- [2]: If the residuals also have independence, i.e., if $\varepsilon_{1} , \cdots , \varepsilon_{n} \overset{\text{iid}}{\sim} \left( 0, \sigma^{2} \right)$ holds, then the unbiased estimator for the variance of $SSE$, $\widehat{\sigma^{2}}$, is as follows. $$ E \widehat{\sigma^{2}} = E \left[ {{ 1 } \over { n-p-1 }} \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] = \sigma^{2} $$
Standard Error of Regression Coefficients
- [3] If the residuals also have normality, i.e., if $\varepsilon_{1} , \cdots , \varepsilon_{n} \overset{\text{iid}}{\sim} N \left( 0, \sigma^{2} \right)$ holds, then the standard error of the regression coefficients is as follows. $$ \text{s.e.} \left( \hat{\beta}_{k} \right) = \hat{\sigma} \sqrt{ \left[ \left( X^{T} X \right)^{-1} \right]_{kk} } $$
Explanation
Most statistics majors, when they first learn regression analysis in school, are overwhelmed by projects and other subjects, so they often gloss over such mathematical statistics theoretical developments. Regardless of their willingness or motivation, the content is too difficult to understand at an undergraduate sophomore or junior level, and forcing oneself to study too hard can also be inefficient. If the proofs don’t make sense from the get-go, it’s okay to turn away without getting frustrated.
However, if one continues their studies at the master’s level or higher and reviews undergraduate courses, it is strongly recommended to check out the well-organized content here. The most important aspect of model diagnosis in multiple regression analysis is linearity, followed by homoscedasticity, independence (../679), and then normality. To derive the t-test and F-test in regression analysis, these assumptions must be added in exactly that order. It might not be intuitive or understandable through experience, but it’s fortunate that even those without such genius can understand through theoretical study.
Proof 1
Strategy: It might not be the easiest. Setting aside mathematical statistics for a moment, one must at least be well-versed in matrix algebra. The statement of the theorem briefly expressed as $P := X \left( X^{T} X \right)^{-1} X^{T}$ for $P$ being idempotent or a projection operator, i.e., $$ \begin{align*} P^{2} =& X \left( X^{T} X \right)^{-1} X^{T} \cdot X \left( X^{T} X \right)^{-1} X^{T} \\ =& X \left( X^{T} X \right)^{-1} \left( X^{T} X \right) \left( X^{T} X \right)^{-1} X^{T} \\ =& X \left( X^{T} X \right)^{-1} X^{T} \\ =& P \\ =& P^{T} \end{align*} $$ hence $P^{2} = P = P^{T}$ and its orthogonal projector $(I-P)$ is also a projection operator, hence $\left( I - P \right) ^{2} = \left( I - P \right)$ holds. This fact will be used as a lemma. If this seems too difficult, it is recommended to study more and return to this proof in a few years.
[1] 2
For the Kronecker delta $\delta_{ij} = \begin{cases} 1 & , \text{if } i = j \\ 0 & , \text{if } i \ne j \end{cases}$, the following holds: $$ \begin{align*} E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] =& E \left[ \left( Y - P Y \right)^{T} \left( Y - P Y \right) \right] \\ =& E \left[ \left[ \left( I_{1+p} - P \right) Y \right] ^{T} \left[ \left( I_{1+p} - P \right) Y \right] \right] \\ =& E \left[ Y^{T} \left( I_{1+p} - P \right)^{T} \left( I_{1+p} - P \right) Y \right] \\ =& E \left[ Y^{T} \left( I_{1+p} - P \right) \left( I_{1+p} - P \right) Y \right] \\ =& E \left[ Y^{T} \left( I_{1+p} - P \right)^{2} Y \right] \\ =& E \left[ Y^{T} \left( I_{1+p} - P \right) Y \right] \\ =& E \left[ \sum_{i,j} y_{i} y_{j} \left( \delta_{ij} - P_{ij} \right) \right] \\ =& \sum_{i,j} E \left[ y_{i} y_{j} \delta_{ij} \right] - \sum_{i,j} E \left[ y_{i} y_{j} P_{ij} \right] \\ =& \sum_{i} E \left[ y_{i}^{2} \right] - \sum_{i,j} E \left[ y_{i} y_{j} \right] P_{ij} \\ =& n \sigma^{2} - \sum_{i,j} E \left[ y_{i} y_{j} \right] P_{ij} \end{align*} $$
■
[2]
That residuals are independent means if $i \ne j$, then $y_{i}$ and $y_{j}$ also have no correlation, and if $i \ne j$, $E \left[ y_{i} y_{j} \right] = 0$ and when $i = j$, according to the linearity and homoscedasticity of residuals, $E \left[ y_{i} y_{j} \right] = \sigma^{2}$ thus we obtain: $$ \begin{align*} E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] =& n \sigma^{2} - \sum_{i,j} E \left[ y_{i} y_{j} \right] P_{ij} \\ =& n \sigma^{2} - \sum_{i} \sigma^{2} P_{ii} \end{align*} $$
Cyclic property of trace: $$ \text{Tr}(ABC) = \text{Tr}(BCA) = \text{Tr}(CAB) $$
$\sum_{i} P_{ii}$ is the trace of $P$, $\text{tr} P$, thus $$ \begin{align*} E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] =& n \sigma^{2} - \sigma^{2} \sum_{i} P_{ii} \\ =& \sigma^{2} \left( n - \text{tr} P \right) \\ =& \sigma^{2} \left( n - \text{tr} X \left( X^{T} X \right)^{-1} X^{T} \right) \\ =& \sigma^{2} \left( n - \text{tr} X^{T} X \left( X^{T} X \right)^{-1} \right) \\ =& \sigma^{2} \left( n - \text{tr} I_{1+p} \right) \\ =& \sigma^{2} \left( n - (1+p) \right) \end{align*} $$ is obtained. Dividing both sides by $(n-p-1)$, $$ {{ 1 } \over { n-p-1 }} E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] = \sigma^{2} $$ hence, the unbiased estimator for $\sigma^{2}$, $\widehat{\sigma^{2}} = \sum \left( y_{i} - \hat{y}_{i} \right)^{2} / (n-p-1)$, is obtained.
■
[3]
Multivariate normality of the regression coefficient vector: $$ \hat{\beta} \sim N_{1+p} \left( \beta , \sigma^{2} \left( X^{T} X \right)^{-1} \right) $$
If residuals follow a normal distribution iid, then the marginal probability distribution of the $k$th component, $\hat{\beta}_{k}$, of $\hat{\beta} = \left( \hat{\beta}_{0} , \cdots , \hat{\beta}_{p} \right)$ also follows the following univariate normal distribution: $$ \hat{\beta}_{k} \sim N \left( \beta_{k} , \sigma^{2} \left[ \left( X^{T} X \right)^{-1} \right]_{kk} \right) $$
General definition of standard error: For some estimator $T$, the estimate of the standard deviation of $T$ is called the Standard Error. $$ \text{s.e.} \left( T \right) := \sqrt{ \widehat{ \operatorname{Var} \left( T \right) } } $$
$\operatorname{Var} \hat{\beta}_{k} = \sigma^{2} \left[ \left( X^{T} X \right)^{-1} \right]_{kk}$ hence we obtain the following. $$ \begin{align*} \text{s.e.} \left( \hat{\beta}_{k} \right) =& \sqrt{ \widehat{ \operatorname{Var} \left( \hat{\beta}_{k} \right) } } \\ =& \sqrt{ \widehat{\sigma^{2} \left[ \left( X^{T} X \right)^{-1} \right]_{kk} } } \\ =& \sqrt{ {{ 1 } \over { n-p-1 }} \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \left[ \left( X^{T} X \right)^{-1} \right]_{kk} } \\ =& \hat{\sigma} \sqrt{ \left[ \left( X^{T} X \right)^{-1} \right]_{kk} } \end{align*} $$
■
Hadi. (2006). Regression Analysis by Example(4th Edition): p82~84. ↩︎
https://math.stackexchange.com/questions/2277739/why-is-the-denominator-n-p-1-in-estimation-of-variance ↩︎