The Definition of Regression Coefficients and Derivation of Estimator Formulas 📂Statistical Analysis

The Definition of Regression Coefficients and Derivation of Estimator Formulas

Definition ¹

$Y = \beta_{0} + \beta_{1} X_{1} + \cdots + \beta_{p} X_{p} + \varepsilon$ In multiple regression analysis, when establishing the above linear model for given $p$ independent variables $X_{1} , \cdots , X_{p}$ , $\beta_{0} , \beta_{1} , \cdots , \beta_{p}$ is called the regression coefficient. $Y$ represents the dependent variable, and $\varepsilon$ indicates randomly distributed errors.

Formula

$\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{bmatrix}$ Given $n$ data points and denoting it as $p < n$ , the linear multiple regression model can be represented in a design matrix as above, and we simply denote it as $Y = X \beta + \varepsilon$ . The least squares estimator vector $\hat{\beta}$ for $\beta$ is as follows. $\hat{\beta} = \begin{bmatrix} \hat{\beta}_{0} \\ \hat{\beta}_{1} \\ \vdots \\ \hat{\beta}_{p} \end{bmatrix} = \left( X^{T} X \right)^{-1} X^{T} Y$ Moreover, since $\hat{\beta}$ is the best unbiased estimator of $\beta$ , it is also known as the Best Linear Unbiased Estimator.

Derivation ² ³

Our goal is $\left\| \varepsilon \right\|_{2}^{2} = \sum_{k=0}^{n} \varepsilon_{k} = \begin{bmatrix} \varepsilon_{0} & \varepsilon_{1} & \cdots & \varepsilon_{n} \end{bmatrix} \begin{bmatrix} \varepsilon_{0} \\ \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{bmatrix} = \varepsilon^{T} \varepsilon$ to minimize. Therefore, finding a $\beta$ that minimizes $\varepsilon^{T} \varepsilon = \left( Y - X \beta \right)^{T} \left( Y - X \beta \right)$ follows from $\varepsilon = Y - X \beta$ .

Gradient of residual sum of squares: $f \left( \mathbf{s} \right) := \left( \mathbf{y} - X \mathbf{s} \right)^{T} R \left( \mathbf{y} - X \mathbf{s} \right)$ Assume that $R$ is an identity matrix, then the following is obtained. ${{ \partial f \left( \mathbf{s} \right) } \over { \partial \mathbf{s} }} = - 2 X^{T} \left( \mathbf{y} - X \mathbf{s} \right)$

By differentiating both sides concerning $\beta$ $\begin{align*} {{ \partial } \over { \partial \beta }} \varepsilon^{T} \varepsilon =& - 2 X^{T} \left( Y - X \beta \right) \\ = & - 2 X^{T} \left( Y - X \beta \right) \\ = & - 2 X^{T} Y + 2 X^{T} X \beta \end{align*}$ results in a zero vector $\mathbf{0}$ for $\hat{\beta}$ like this. $\hat{\beta} = \argmin_{\beta} \varepsilon^{T} \varepsilon = \left( X^{T} X \right)^{-1} X^{T} Y$ Meanwhile, it can be easily shown that $\hat{\beta}$ is an unbiased estimator for $\beta$ , and since it is derived via the least squares method, there is no unbiased estimator with smaller variance for $\beta$ , making it the best unbiased estimator.

■

If differentiating concerning $\beta$ in the derivation is not particularly appealing, an alternative approach using matrix algebra is available. In least squares method in matrix algebra, $X^{\ast} Y = X^{\ast} X \hat{\beta}$ the fact that $\hat{\beta}$ satisfies being the least squares solution shows that since $X \in \mathbb{R}^{n \times p}$ holds, $X^{\ast} = X^{T}$ follows, and consequently, $\hat{\beta} = \left( X^{T} X \right)^{-1} X^{T} Y$ is derived.

Corollary

If $\hat{\beta}$ is the best linear unbiased estimator, then the sum of $y_{k}$ and the sum of fits $\hat{y}_{k=1} = \hat{\beta}_{0} + \sum_{j=1}^{p} \hat{\beta}_{j} x_{j}$ are equal: $\sum_{k=1}^{n} y_{k} = \sum_{k=1}^{n} \hat{y}_{k}$

Proof

In proving this formula, that $\hat{\beta}$ is the best linear unbiased estimator means $\begin{align*} & \mathbf{0} = - 2 X^{T} Y + 2 X^{T} X \hat{\beta} \\ \implies & \mathbf{0} = X^{T} \left( Y - X \hat{\beta} \right) \\ \implies & \mathbf{0} = X^{T} \begin{bmatrix} y_{1} - \hat{y}_{1} \\ \vdots \\ y_{n} - \hat{y}_{n} \end{bmatrix} \end{align*}$ that this holds. Since $X$ is the design matrix, the first row of $X^{T}$ can be regarded as all elements being $1$ in a row matrix of ones. Considering only the product of the first row of $X^{T}$ with $Y - X \hat{\beta}$ yields the following. $\begin{align*} & 0 = \begin{bmatrix} 1 & \cdots & 1 \end{bmatrix} \begin{bmatrix} y_{1} - \hat{y}_{1} \\ \vdots \\ y_{n} - \hat{y}_{n} \end{bmatrix} \\ \implies & 0 = \left( y_{1} - \hat{y}_{1} \right) + \cdots + \left( y_{n} - \hat{y}_{n} \right) \\ \implies & 0 = \sum_{k=1}^{n} y_{k} - \sum_{k=1}^{n} \hat{y}_{k} \end{align*}$ Consequently, the following is obtained. $\sum_{k=1}^{n} y_{k} = \sum_{k=1}^{n} \hat{y}_{k}$

■