logo

Design Matrix 📂Statistical Analysis

Design Matrix

Buildup

Let’s load the built-in data faithful in R and check it with the head() function.

20180426\_214849.png

Though only six, at a glance, eruptions and waiting seem to have a positive correlation. It would be nice if their relationship could be represented by some two constants β0,β1\beta_{0}, \beta_{1} such that (eruptions)=β0+β1(waiting) \text{(eruptions)} = \beta_{0} + \beta_{1} \cdot \text{(waiting) } The above equation represents the linear relationship between the two variables as the equation of a line, where β0\beta_{0} is the constant term, and β1\beta_{1} represents the slope.

However, in reality, due to differences from theory, some error term ε\varepsilon is necessary. For simplification, if we denote this as y:=(eruptions)y:=\text{(eruptions)} and x:=(waiting) x:=\text{(waiting) }, we get the following. y=β0+β1x+εy = \beta_{0} + \beta_{1} x + \varepsilon

The screenshot above shows a total of 66 ordered pairs, representing them as a system of equations yields the following.

{3.600=β0+β179+ε11.800=β0+β154+ε23.333=β0+β174+ε32.283=β0+β162+ε44.533=β0+β185+ε52.883=β0+β155+ε6 \begin{cases} 3.600 = \beta_{0} + \beta_{1} 79 + \varepsilon_{1} \\ 1.800 = \beta_{0} + \beta_{1} 54 + \varepsilon_{2} \\ 3.333 = \beta_{0} + \beta_{1} 74 + \varepsilon_{3} \\ 2.283 = \beta_{0} + \beta_{1} 62 + \varepsilon_{4} \\ 4.533 = \beta_{0} + \beta_{1} 85 + \varepsilon_{5} \\ 2.883 = \beta_{0} + \beta_{1} 55 + \varepsilon_{6} \end{cases}

Actually, faithful contains as many as 272 ordered pairs, thus representing all of them in this manner is impractical, let’s represent them using symbols again.

{y1=β0+β1x1+ε1y2=β0+β1x2+ε2y272=β0+β1x272+ε272 \begin{cases} y_{1} &= \beta_{0} + \beta_{1} x_{1} + \varepsilon_{1} \\ y_{2} &= \beta_{0} + \beta_{1} x_{2} + \varepsilon_{2} \\ &\vdots& \\ y_{272} &= \beta_{0} + \beta_{1} x_{272} + \varepsilon_{272} \end{cases}

Meanwhile, such a system of equations can be more simply represented as a matrix equation. [y1y2y272]=[1x11x21x272][β0β1]+[ε1ε2ε272] \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{272} \end{bmatrix} = \begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{272} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix} + \begin{bmatrix} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{272} \end{bmatrix} As usual, if even the matrix is represented with uppercase, you finally get Y=Xβ+εY = X \beta + \varepsilon.

Definition

Here, the matrix grouping together the independent variables, as indicated by XX, is called the Design Matrix.

Prerequisites

Thus, being able to represent data as a matrix means that various tools from linear algebra can be applied to statistics. Finding β\beta here is precisely what regression analysis is about, and understanding it accurately requires knowledge of linear algebra.

Many learners approaching statistics do not feel the need for linear algebra and neglect it, only to feel challenged when matrices are introduced. To avoid following in the footsteps of your predecessors, it’s crucial to thoroughly grind and polish lower-year subjects like analysis and linear algebra.