Design Matrix
Buildup
Let’s load the built-in data faithful
in R and check it with the head()
function.
Though only six, at a glance, eruptions
and waiting
seem to have a positive correlation. It would be nice if their relationship could be represented by some two constants $\beta_{0}, \beta_{1}$ such that
$$\text{(eruptions)} = \beta_{0} + \beta_{1} \cdot \text{(waiting) }$$
The above equation represents the linear relationship between the two variables as the equation of a line, where $\beta_{0}$ is the constant term, and $\beta_{1}$ represents the slope.
However, in reality, due to differences from theory, some error term $\varepsilon$ is necessary. For simplification, if we denote this as $y:=\text{(eruptions)}$ and $x:=\text{(waiting) }$, we get the following. $$y = \beta_{0} + \beta_{1} x + \varepsilon$$
The screenshot above shows a total of $6$ ordered pairs, representing them as a system of equations yields the following.
$$ \begin{cases} 3.600 = \beta_{0} + \beta_{1} 79 + \varepsilon_{1} \\ 1.800 = \beta_{0} + \beta_{1} 54 + \varepsilon_{2} \\ 3.333 = \beta_{0} + \beta_{1} 74 + \varepsilon_{3} \\ 2.283 = \beta_{0} + \beta_{1} 62 + \varepsilon_{4} \\ 4.533 = \beta_{0} + \beta_{1} 85 + \varepsilon_{5} \\ 2.883 = \beta_{0} + \beta_{1} 55 + \varepsilon_{6} \end{cases} $$
Actually, faithful
contains as many as 272 ordered pairs, thus representing all of them in this manner is impractical, let’s represent them using symbols again.
$$ \begin{cases} y_{1} &= \beta_{0} + \beta_{1} x_{1} + \varepsilon_{1} \\ y_{2} &= \beta_{0} + \beta_{1} x_{2} + \varepsilon_{2} \\ &\vdots& \\ y_{272} &= \beta_{0} + \beta_{1} x_{272} + \varepsilon_{272} \end{cases} $$
Meanwhile, such a system of equations can be more simply represented as a matrix equation. $$ \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{272} \end{bmatrix} = \begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{272} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{272} \end{bmatrix} $$ As usual, if even the matrix is represented with uppercase, you finally get $Y = X \beta + \varepsilon$.
Definition
Here, the matrix grouping together the independent variables, as indicated by $X$, is called the Design Matrix.
Prerequisites
Thus, being able to represent data as a matrix means that various tools from linear algebra can be applied to statistics. Finding $\beta$ here is precisely what regression analysis is about, and understanding it accurately requires knowledge of linear algebra.
Many learners approaching statistics do not feel the need for linear algebra and neglect it, only to feel challenged when matrices are introduced. To avoid following in the footsteps of your predecessors, it’s crucial to thoroughly grind and polish lower-year subjects like analysis and linear algebra.