logo

Regression Analysis Including Qualitative Variables 📂Statistical Analysis

Regression Analysis Including Qualitative Variables

Overview

Regression analysis does not always guarantee that quantitative variables are used as independent variables. There is also a need to reflect categorical data in the analysis, such as what gender someone is, which company they belong to, what color something is, whether it’s a metal, etc.

Build-up 1

Imagine guessing the starting salary $Y$ with the nationwide exam score $X_{1}$, age $X_{2}$, gender $S$, and the highest education level $E$ of all job seekers this year. Using multiple regression analysis, we can know how scores $X_{1}$ and age $ X_{2}$ affect the annual salary $Y$ like $Y \gets X_{1} + X_{2}$. The regression equation we get would be $$Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \varepsilon $$ and through hypothesis testing, we should discard variables that are not statistically significant. If we add gender $S$ to this, $$Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \varepsilon $$ the equation will be modified like this. However, since male or female are not numbers, we cannot multiply the regression coefficient by them. That’s when the idea of using an Indicator Function came up.

In statistics, an indicator function is a function that takes the value $1$ or $0$ depending on whether the data belongs to the specified category or not. By defining variables like $S = \begin{cases} 1 & ,\text{여성} \\ 0 & ,\text{남성} \end{cases}$ using the indicator function, the final regression equation will be

  • For males $$Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} + \varepsilon $$
  • For females $$Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \varepsilon $$

Furthermore, if educational background $E$ is added, $$Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \beta_{4} E + \varepsilon $$ the equation will be modified like this. There could be several cases for educational background, unlike gender, but let’s just think of four cases like $E = \begin{cases} 3 & ,\text{박사} \\ 2 & ,\text{석사} \\ 1 & ,\text{대졸} \\ 0 & ,\text{고졸} \end{cases}$. It’s difficult to use the indicator function because there are many cases, but it can be resolved easily by dividing it into three variables $$E_{1} = \begin{cases} 1 & ,\text{대졸} \\ 0 & ,\text{고졸} \end{cases} \\ E_{2} = \begin{cases} 1 & ,\text{석사} \\ 0 & ,\text{고졸} \end{cases} \\ E_{3} = \begin{cases} 1 & ,\text{박사} \\ 0 & ,\text{고졸} \end{cases}$$ In the end, the regression equation expands to $$Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \gamma_{1} E_{1} + \gamma_{2} E_{2} + \gamma_{3} E_{3} + \varepsilon $$

Practice

In reality, performing regression analysis with qualitative variables in R does not require such complex manipulation. Let’s load the built-in data ToothGrowth.

20180907\_165557.png ToothGrowth is data recorded from guinea pigs fed different amounts (dose) of Vitamin C or orange juice (supp) and the length (len) of their teeth. As with multiple regression analysis, analysis is done automatically when categorical variables are included.

20180907\_165610.png The results suggest that an increase in dosage also increases the length of the teeth, and administering Vitamin C actually reduces the length.

Logically, guinea pigs consistently consuming orange juice should have a better nutritional balance, especially for minerals like calcium, which are easy to regulate. And not an excessive amount was given in the experiment, so it makes sense that the more, the better. The explanatory power is also not bad at $R_{a} = 0.6934$, and overall, there seem to be no significant issues.

But there’s something fishy when we look at the residual plot.

1.png Looking at the data, Vitamin C was administered up to the 30th experiment, and from the 31st, orange juice was given. At a glance, a large amount of Vitamin, followed by a small amount of Vitamin seems to have a positive effect on tooth length. Of course, such coincidences can happen in the real world, but then there would be no reason to study statistics in the first place. Whether these issues can be eliminated with effort is uncertain, but scholars have named this phenomenon interaction and have been researching solutions.

Code

Below is an example code.

head(ToothGrowth); tail(ToothGrowth); str(ToothGrowth)
out1<-lm(len~.,data=ToothGrowth); summary(out1)
 
win.graph(5,5); plot(rstudent(out1), main="Standardized Residual Plot 1"); abline(h=0)

  1. Hadi. (2006). Regression Analysis by Example(4th Edition): p121~125. ↩︎