Simple Regression Analysis
Overview
Regression Analysis is a method for identifying relationships between variables, especially useful for elucidating linear relationships. Simple Linear Regression is the simplest among them, referring to regression analysis on one dependent (response) variable and one independent (explanatory) variable.
Model 1
The statement that independent variable $x_{i}$ and dependent variable $y_{i}$ have a linear relationship means that for some $a,b$, it can be expressed as $y_{i} = ax_{i} + b$. Of course, there will inevitably be errors concerning the actual data, so more precisely, it includes an error term, becoming $y_{i} = ax_{i} + b + \varepsilon_{i}$. Converting this expression into a form more commonly used in regression analysis $$ y_{i} = \beta_{0} + \beta_{1} x_{i} + \varepsilon_{i} $$ When represented as a design matrix, $$ \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{bmatrix} $$ and by arranging, we get $Y = X \beta + \varepsilon$.
Optimization
This boils down to finding $\beta = \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix}$ that minimizes $\| \varepsilon \|_{2} = \| Y - X \beta \|_{2}$ through the method of least squares. Since $\beta$ represents the intercept and slope of the line, finding $\beta$ is about finding the line that explains the data in linear terms with the least error. Of course, since we can’t know exactly what relationship the two variables have, we need to find the estimates of regression coefficients, $\hat{ \beta_{0}}$ and $\hat{ \beta_{1} }$. In simple terms, draw the line that looks the most like the data.
These problems are typically solved using tools from numerical linear algebra, but simple regression analysis can also be solved with simple calculus. Rearranging the matrix and representing it again, $$ \begin{align} \varepsilon^2 = \sum_{i=1}^{n} ( y_{i} - \beta_{0} - \beta_{1} x_{i} )^2 \end{align} $$ is about finding $\beta_{0} = \hat{ \beta_{0} }$ and $\beta_{1} = \hat {\beta_{1}}$ that minimize it. Taking the partial derivative of equation $(1)$ with respect to $\beta_{0}$, $$ {{ \partial \varepsilon^2 } \over { \partial \beta_{0}}} = -2 \sum_{i=1}^{n} (y_{i} - \beta_{0} - \beta_{1} x_{i} ) $$ For $\varepsilon^2$ to be minimized, $$ n \beta_{0} = \sum_{i=1}^{n} y_{i} - \beta_{1} \sum_{i=1}^{n} x_{i} $$ Thus, $\varepsilon^2$ is minimized when it is $\beta_{0} = \overline{y} - \beta_{1} \overline{x}$. Taking the partial derivative of equation $(1)$ with respect to $\beta_{1}$, $$ {{ \partial \varepsilon^2 } \over { \partial \beta_{1}}} = -2 \sum_{i=1}^{n} x_{i} (y_{i} - \beta_{0} - \beta_{1} x_{i} ) $$ Since $\varepsilon^2$ is minimized when it is $\beta_{0} = \overline{y} - \beta_{1} \overline{x}$, $$ \sum_{i=1}^{n} x_{i} (y_{i} - \overline{y} + \beta_{1} \overline{x} - \beta_{1} x_{i} ) = 0 $$ In other words, $$ \beta_{1} \sum_{i=1}^{n} ( x_{i}^2 - \overline{x} x_{i} ) = \sum_{i=1}^{n} x_{i} y_{i} - \sum_{i=1}^{n} x_{i} \overline{y} $$ Summarizing, $$ \begin{align*} \beta_{1} =& {{\sum_{i=1}^{n} x_{i} y_{i} - \sum_{i=1}^{n} x_{i} \overline{y} } \over {\sum_{i=1}^{n} ( x_{i}^2 - \overline{x} x_{i} ) }} \\ =& {{ \sum_{i=1}^n ( x_{i} - \overline{x} ) ( y_{i} - \overline{y} ) } \over { \sum_{i=1}^{n} (x_{i}^2 - \overline{x}^2 )}} \\ =& {{ \operatorname{Cov} (X,Y) } \over { \operatorname{Var} ( X ) }} \\ =& \text{Cor} (X,Y) {{s_{y}} \over {s_{x}}} \end{align*} $$ For the actual calculation, one should find $\hat{\beta_{0}}$ before $\hat{\beta_{1}}$.
See Also
- Simple Regression Analysis Results in R
- T-test of Regression Coefficients
- Multiple Regression Analysis: Extending the model to multiple independent variables, unlike simple regression analysis.
- Derivation of the Estimate of Multiple Regression Coefficients Vector
Hadi. (2006). Regression Analysis by Example(4th Edition): p24. ↩︎