logo

Proof of Normality of Regression Coefficients 📂Probability Distribution

Proof of Normality of Regression Coefficients

Theorem

$$ \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{bmatrix} $$ Given $p$ independent variables and $n$ data points, the linear multiple regression model can be represented by a design matrix as shown above, which we simply denote as $Y = X \beta + \varepsilon$. It is assumed that the residuals exhibit homoscedasticity, independence, and normality, meaning $$ \varepsilon_{1} , \cdots , \varepsilon_{n} \overset{\text{iid}}{\sim} N \left( 0 , \sigma^{2} \right) \iff \varepsilon \sim N_{n} \left( \mathbf{0} , \sigma^{2} I_{n} \right) $$ under this assumption, the estimated regression coefficients $$ \hat{\beta} = \left( \hat{\beta}_{0}, \hat{\beta}_{1}, \cdots, \hat{\beta}_{p} \right) = \left( X^{T} X \right)^{-1} X^{T} Y $$ follow a multivariate normal distribution. $$ \hat{\beta} \sim N_{1+p} \left( \beta , \sigma^{2} \left( X^{T} X \right)^{-1} \right) $$ Moreover, $\hat{\beta}$ is the best linear unbiased estimator of $\beta$, hence it is also called the Best Linear Unbiased Estimator (BLUE).


Description

The fact that the vector of regression coefficients follows a multivariate normal distribution is especially important for hypothesis testing related to regression coefficients, and it requires diagnosing the homoscedasticity, independence, and normality of the residuals.

Proof

Strategy: There is not really a strategy per se as everything is easily derived assuming normality of the residuals. Knowing that $X$ and $Y$ are not random variables but data fixed in a matrix form, i.e., constants, it’s all just matrix calculations.

Normality 1

$$ \begin{align*} \hat{\beta} =& \left( X^{T} X \right)^{-1} X^{T} Y \\ =& \left( X^{T} X \right)^{-1} X^{T} \left( X \beta + \varepsilon \right) \\ =& I_{1+p} \beta + \left( X^{T} X \right)^{-1} X^{T} \varepsilon \end{align*} $$ Thus, since $\hat{\beta}$ is a linear transformation of $\varepsilon$ and assuming that $\varepsilon$ follows a multivariate normal distribution, $\hat{\beta}$ also follows a multivariate normal distribution.

Mean

$$ \begin{align*} E \hat{\beta} =& E \left[ I_{1+p} \beta + \left( X^{T} X \right)^{-1} X^{T} \varepsilon \right] \\ =& E I_{1+p} \beta + E \left[ \left( X^{T} X \right)^{-1} X^{T} \varepsilon \right] \\ =& E I_{1+p} \beta + \left( X^{T} X \right)^{-1} X^{T} E \varepsilon \\ =& E \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix} + \mathbf{0} \\ =& \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix} \end{align*} $$ Consequently, $\hat{\beta}$ is also an unbiased estimator of $\beta$.

Variance 2 3

$$ \begin{align*} \operatorname{Var} \hat{\beta} =& \operatorname{Var} \left[ I_{1+p} \beta + \left( X^{T} X \right)^{-1} X^{T} \varepsilon \right] \\ =& \operatorname{Var} \left[ \left( X^{T} X \right)^{-1} X^{T} \varepsilon \right] \\ =& \left( X^{T} X \right)^{-1} X^{T} \left( \operatorname{Var} \varepsilon \right) \left( \left( X^{T} X \right)^{-1} X^{T} \right)^{T} \\ =& \left( X^{T} X \right)^{-1} X^{T} \sigma^{2} I_{1+p} X \left( X^{T} X \right)^{-1} \\ =& \sigma^{2} \left( X^{T} X \right)^{-1} X^{T} X \left( X^{T} X \right)^{-1} \\ =& \sigma^{2} \left( X^{T} X \right)^{-1} \end{align*} $$ Meanwhile, since $\hat{\beta}$ is derived through the least squares method, there does not exist an unbiased estimator of $\beta$ with a smaller variance, thus it is the best unbiased estimator.