logo

Regression Analysis Including Qualitative Variables 📂Statistical Analysis

Regression Analysis Including Qualitative Variables

Overview

Regression analysis does not always guarantee that quantitative variables are used as independent variables. There is also a need to reflect categorical data in the analysis, such as what gender someone is, which company they belong to, what color something is, whether it’s a metal, etc.

Build-up 1

Imagine guessing the starting salary YY with the nationwide exam score X1X_{1}, age X2X_{2}, gender SS, and the highest education level EE of all job seekers this year. Using multiple regression analysis, we can know how scores X1X_{1} and age X2 X_{2} affect the annual salary YY like YX1+X2Y \gets X_{1} + X_{2}. The regression equation we get would be Y=β0+β1X1+β2X2+εY = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \varepsilon and through hypothesis testing, we should discard variables that are not statistically significant. If we add gender SS to this, Y=β0+β1X1+β2X2+β3S+εY = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \varepsilon the equation will be modified like this. However, since male or female are not numbers, we cannot multiply the regression coefficient by them. That’s when the idea of using an Indicator Function came up.

In statistics, an indicator function is a function that takes the value 11 or 00 depending on whether the data belongs to the specified category or not. By defining variables like S={1,여성0,남성S = \begin{cases} 1 & ,\text{여성} \\ 0 & ,\text{남성} \end{cases} using the indicator function, the final regression equation will be

  • For males Y=β0+β1X1+β2X2+β3+εY = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} + \varepsilon
  • For females Y=β0+β1X1+β2X2+εY = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \varepsilon

Furthermore, if educational background EE is added, Y=β0+β1X1+β2X2+β3S+β4E+εY = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \beta_{4} E + \varepsilon the equation will be modified like this. There could be several cases for educational background, unlike gender, but let’s just think of four cases like E={3,박사2,석사1,대졸0,고졸E = \begin{cases} 3 & ,\text{박사} \\ 2 & ,\text{석사} \\ 1 & ,\text{대졸} \\ 0 & ,\text{고졸} \end{cases}. It’s difficult to use the indicator function because there are many cases, but it can be resolved easily by dividing it into three variables E1={1,대졸0,고졸E2={1,석사0,고졸E3={1,박사0,고졸E_{1} = \begin{cases} 1 & ,\text{대졸} \\ 0 & ,\text{고졸} \end{cases} \\ E_{2} = \begin{cases} 1 & ,\text{석사} \\ 0 & ,\text{고졸} \end{cases} \\ E_{3} = \begin{cases} 1 & ,\text{박사} \\ 0 & ,\text{고졸} \end{cases} In the end, the regression equation expands to Y=β0+β1X1+β2X2+β3S+γ1E1+γ2E2+γ3E3+εY = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \gamma_{1} E_{1} + \gamma_{2} E_{2} + \gamma_{3} E_{3} + \varepsilon

Practice

In reality, performing regression analysis with qualitative variables in R does not require such complex manipulation. Let’s load the built-in data ToothGrowth.

20180907\_165557.png ToothGrowth is data recorded from guinea pigs fed different amounts (dose) of Vitamin C or orange juice (supp) and the length (len) of their teeth. As with multiple regression analysis, analysis is done automatically when categorical variables are included.

20180907\_165610.png The results suggest that an increase in dosage also increases the length of the teeth, and administering Vitamin C actually reduces the length.

Logically, guinea pigs consistently consuming orange juice should have a better nutritional balance, especially for minerals like calcium, which are easy to regulate. And not an excessive amount was given in the experiment, so it makes sense that the more, the better. The explanatory power is also not bad at Ra=0.6934R_{a} = 0.6934, and overall, there seem to be no significant issues.

But there’s something fishy when we look at the residual plot.

1.png Looking at the data, Vitamin C was administered up to the 30th experiment, and from the 31st, orange juice was given. At a glance, a large amount of Vitamin, followed by a small amount of Vitamin seems to have a positive effect on tooth length. Of course, such coincidences can happen in the real world, but then there would be no reason to study statistics in the first place. Whether these issues can be eliminated with effort is uncertain, but scholars have named this phenomenon interaction and have been researching solutions.

Code

Below is an example code.

head(ToothGrowth); tail(ToothGrowth); str(ToothGrowth)
out1<-lm(len~.,data=ToothGrowth); summary(out1)
 
win.graph(5,5); plot(rstudent(out1), main="Standardized Residual Plot 1"); abline(h=0)

  1. Hadi. (2006). Regression Analysis by Example(4th Edition): p121~125. ↩︎