logo

Estimation of the Variance of Residuals and Standard Errors of Regression Coefficients in Multiple Regression Analysis 📂Statistical Analysis

Estimation of the Variance of Residuals and Standard Errors of Regression Coefficients in Multiple Regression Analysis

Theorem

[y1y2yn]=[1x11xp11x12xp21x1nxpn][β0β1βp]+[ε1ε2εn] \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{bmatrix} When there are pp independent variables and nn pieces of data, the linear multiple regression model can be represented by a design matrix as shown above, and let’s simply express it as Y=Xβ+εY = X \beta + \varepsilon. Since the estimate of the regression coefficient is β^=(XTX)1XTY\hat{\beta} = \left( X^{T} X \right)^{-1} X^{T} Y, the vector Y^\hat{Y} of fitted values Y^=Xβ^=X(XTX)1XTY \hat{Y} = X \hat{\beta} = X \left( X^{T} X \right)^{-1} X^{T} Y is known. For convenience, let’s set it as P:=X(XTX)1XTP := X \left( X^{T} X \right)^{-1} X^{T}. Meanwhile, it is assumed that residuals have linearity, i.e., the population mean of ε1,,εn\varepsilon_{1} , \cdots , \varepsilon_{n} is 00.

Expected Value of the Sum of Squared Residuals

  • [1]: If the residuals also have homoscedasticity, i.e., if for some constant σ>0\sigma > 0, ε1,,εn(0,σ2)\varepsilon_{1} , \cdots , \varepsilon_{n} \sim \left( 0, \sigma^{2} \right) holds, then the expected value of SSESSE is as follows. E(SSE)=E[i=1n(yiy^i)2]=nσ2i,jE(yiyj)Pij E \left( SSE \right) = E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] = n \sigma^{2} - \sum_{i,j} E \left( y_{i} y_{j} \right) P_{ij}

Unbiased Estimator for the Variance of the Sum of Squared Residuals

  • [2]: If the residuals also have independence, i.e., if ε1,,εniid(0,σ2)\varepsilon_{1} , \cdots , \varepsilon_{n} \overset{\text{iid}}{\sim} \left( 0, \sigma^{2} \right) holds, then the unbiased estimator for the variance of SSESSE, σ2^\widehat{\sigma^{2}}, is as follows. Eσ2^=E[1np1i=1n(yiy^i)2]=σ2 E \widehat{\sigma^{2}} = E \left[ {{ 1 } \over { n-p-1 }} \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] = \sigma^{2}

Standard Error of Regression Coefficients

  • [3] If the residuals also have normality, i.e., if ε1,,εniidN(0,σ2)\varepsilon_{1} , \cdots , \varepsilon_{n} \overset{\text{iid}}{\sim} N \left( 0, \sigma^{2} \right) holds, then the standard error of the regression coefficients is as follows. s.e.(β^k)=σ^[(XTX)1]kk \text{s.e.} \left( \hat{\beta}_{k} \right) = \hat{\sigma} \sqrt{ \left[ \left( X^{T} X \right)^{-1} \right]_{kk} }

Explanation

Most statistics majors, when they first learn regression analysis in school, are overwhelmed by projects and other subjects, so they often gloss over such mathematical statistics theoretical developments. Regardless of their willingness or motivation, the content is too difficult to understand at an undergraduate sophomore or junior level, and forcing oneself to study too hard can also be inefficient. If the proofs don’t make sense from the get-go, it’s okay to turn away without getting frustrated.

However, if one continues their studies at the master’s level or higher and reviews undergraduate courses, it is strongly recommended to check out the well-organized content here. The most important aspect of model diagnosis in multiple regression analysis is linearity, followed by homoscedasticity, independence (../679), and then normality. To derive the t-test and F-test in regression analysis, these assumptions must be added in exactly that order. It might not be intuitive or understandable through experience, but it’s fortunate that even those without such genius can understand through theoretical study.

Proof 1

Strategy: It might not be the easiest. Setting aside mathematical statistics for a moment, one must at least be well-versed in matrix algebra. The statement of the theorem briefly expressed as P:=X(XTX)1XTP := X \left( X^{T} X \right)^{-1} X^{T} for PP being idempotent or a projection operator, i.e., P2=X(XTX)1XTX(XTX)1XT=X(XTX)1(XTX)(XTX)1XT=X(XTX)1XT=P=PT \begin{align*} P^{2} =& X \left( X^{T} X \right)^{-1} X^{T} \cdot X \left( X^{T} X \right)^{-1} X^{T} \\ =& X \left( X^{T} X \right)^{-1} \left( X^{T} X \right) \left( X^{T} X \right)^{-1} X^{T} \\ =& X \left( X^{T} X \right)^{-1} X^{T} \\ =& P \\ =& P^{T} \end{align*} hence P2=P=PTP^{2} = P = P^{T} and its orthogonal projector (IP)(I-P) is also a projection operator, hence (IP)2=(IP)\left( I - P \right) ^{2} = \left( I - P \right) holds. This fact will be used as a lemma. If this seems too difficult, it is recommended to study more and return to this proof in a few years.

[1] 2

For the Kronecker delta δij={1,if i=j0,if ij\delta_{ij} = \begin{cases} 1 & , \text{if } i = j \\ 0 & , \text{if } i \ne j \end{cases}, the following holds: E[i=1n(yiy^i)2]=E[(YPY)T(YPY)]=E[[(I1+pP)Y]T[(I1+pP)Y]]=E[YT(I1+pP)T(I1+pP)Y]=E[YT(I1+pP)(I1+pP)Y]=E[YT(I1+pP)2Y]=E[YT(I1+pP)Y]=E[i,jyiyj(δijPij)]=i,jE[yiyjδij]i,jE[yiyjPij]=iE[yi2]i,jE[yiyj]Pij=nσ2i,jE[yiyj]Pij \begin{align*} E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] =& E \left[ \left( Y - P Y \right)^{T} \left( Y - P Y \right) \right] \\ =& E \left[ \left[ \left( I_{1+p} - P \right) Y \right] ^{T} \left[ \left( I_{1+p} - P \right) Y \right] \right] \\ =& E \left[ Y^{T} \left( I_{1+p} - P \right)^{T} \left( I_{1+p} - P \right) Y \right] \\ =& E \left[ Y^{T} \left( I_{1+p} - P \right) \left( I_{1+p} - P \right) Y \right] \\ =& E \left[ Y^{T} \left( I_{1+p} - P \right)^{2} Y \right] \\ =& E \left[ Y^{T} \left( I_{1+p} - P \right) Y \right] \\ =& E \left[ \sum_{i,j} y_{i} y_{j} \left( \delta_{ij} - P_{ij} \right) \right] \\ =& \sum_{i,j} E \left[ y_{i} y_{j} \delta_{ij} \right] - \sum_{i,j} E \left[ y_{i} y_{j} P_{ij} \right] \\ =& \sum_{i} E \left[ y_{i}^{2} \right] - \sum_{i,j} E \left[ y_{i} y_{j} \right] P_{ij} \\ =& n \sigma^{2} - \sum_{i,j} E \left[ y_{i} y_{j} \right] P_{ij} \end{align*}

[2]

That residuals are independent means if iji \ne j, then yiy_{i} and yjy_{j} also have no correlation, and if iji \ne j, E[yiyj]=0E \left[ y_{i} y_{j} \right] = 0 and when i=ji = j, according to the linearity and homoscedasticity of residuals, E[yiyj]=σ2E \left[ y_{i} y_{j} \right] = \sigma^{2} thus we obtain: E[i=1n(yiy^i)2]=nσ2i,jE[yiyj]Pij=nσ2iσ2Pii \begin{align*} E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] =& n \sigma^{2} - \sum_{i,j} E \left[ y_{i} y_{j} \right] P_{ij} \\ =& n \sigma^{2} - \sum_{i} \sigma^{2} P_{ii} \end{align*}

Cyclic property of trace: Tr(ABC)=Tr(BCA)=Tr(CAB) \text{Tr}(ABC) = \text{Tr}(BCA) = \text{Tr}(CAB)

iPii\sum_{i} P_{ii} is the trace of PP, trP\text{tr} P, thus E[i=1n(yiy^i)2]=nσ2σ2iPii=σ2(ntrP)=σ2(ntrX(XTX)1XT)=σ2(ntrXTX(XTX)1)=σ2(ntrI1+p)=σ2(n(1+p)) \begin{align*} E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] =& n \sigma^{2} - \sigma^{2} \sum_{i} P_{ii} \\ =& \sigma^{2} \left( n - \text{tr} P \right) \\ =& \sigma^{2} \left( n - \text{tr} X \left( X^{T} X \right)^{-1} X^{T} \right) \\ =& \sigma^{2} \left( n - \text{tr} X^{T} X \left( X^{T} X \right)^{-1} \right) \\ =& \sigma^{2} \left( n - \text{tr} I_{1+p} \right) \\ =& \sigma^{2} \left( n - (1+p) \right) \end{align*} is obtained. Dividing both sides by (np1)(n-p-1), 1np1E[i=1n(yiy^i)2]=σ2 {{ 1 } \over { n-p-1 }} E \left[ \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \right] = \sigma^{2} hence, the unbiased estimator for σ2\sigma^{2}, σ2^=(yiy^i)2/(np1)\widehat{\sigma^{2}} = \sum \left( y_{i} - \hat{y}_{i} \right)^{2} / (n-p-1), is obtained.

[3]

Multivariate normality of the regression coefficient vector: β^N1+p(β,σ2(XTX)1) \hat{\beta} \sim N_{1+p} \left( \beta , \sigma^{2} \left( X^{T} X \right)^{-1} \right)

If residuals follow a normal distribution iid, then the marginal probability distribution of the kkth component, β^k\hat{\beta}_{k}, of β^=(β^0,,β^p)\hat{\beta} = \left( \hat{\beta}_{0} , \cdots , \hat{\beta}_{p} \right) also follows the following univariate normal distribution: β^kN(βk,σ2[(XTX)1]kk) \hat{\beta}_{k} \sim N \left( \beta_{k} , \sigma^{2} \left[ \left( X^{T} X \right)^{-1} \right]_{kk} \right)

General definition of standard error: For some estimator TT, the estimate of the standard deviation of TT is called the Standard Error. s.e.(T):=Var(T)^ \text{s.e.} \left( T \right) := \sqrt{ \widehat{ \operatorname{Var} \left( T \right) } }

Varβ^k=σ2[(XTX)1]kk\operatorname{Var} \hat{\beta}_{k} = \sigma^{2} \left[ \left( X^{T} X \right)^{-1} \right]_{kk} hence we obtain the following. s.e.(β^k)=Var(β^k)^=σ2[(XTX)1]kk^=1np1i=1n(yiy^i)2[(XTX)1]kk=σ^[(XTX)1]kk \begin{align*} \text{s.e.} \left( \hat{\beta}_{k} \right) =& \sqrt{ \widehat{ \operatorname{Var} \left( \hat{\beta}_{k} \right) } } \\ =& \sqrt{ \widehat{\sigma^{2} \left[ \left( X^{T} X \right)^{-1} \right]_{kk} } } \\ =& \sqrt{ {{ 1 } \over { n-p-1 }} \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \left[ \left( X^{T} X \right)^{-1} \right]_{kk} } \\ =& \hat{\sigma} \sqrt{ \left[ \left( X^{T} X \right)^{-1} \right]_{kk} } \end{align*}