logo

Influence of Interaction in Regression Analysis 📂Statistical Analysis

Influence of Interaction in Regression Analysis

Buildup

It is recommended to read about regression analysis including qualitative variables first.

Imagine guessing this year’s graduates’ starting salaries based on their college entrance exam scores X1X_{1}, age X2X_{2}, gender SS, and final educational attainment EE. Firstly, with the presence of qualitative variables, gender is defined as

S={1,여성0,남성 S = \begin{cases} 1 & ,\text{여성} \\ 0 & ,\text{남성} \end{cases}

and educational attainment as

E1={1,대졸0,고졸E2={1,석사0,고졸E3={1,박사0,고졸 E_{1} = \begin{cases} 1 & ,\text{대졸} \\ 0 & ,\text{고졸} \end{cases} \\ E_{2} = \begin{cases} 1 & ,\text{석사} \\ 0 & ,\text{고졸} \end{cases} \\ E_{3} = \begin{cases} 1 & ,\text{박사} \\ 0 & ,\text{고졸} \end{cases}

then, the regression equation would be represented as

Y=β0+β1X1+β2X2+β3S+γ1E1+γ2E2+γ3E3+ε Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \gamma_{1} E_{1} + \gamma_{2} E_{2} + \gamma_{3} E_{3} + \varepsilon

It is speculated that a higher college entrance exam score indicates intelligence or diligence, older age suggests more experience, men are primarily involved in high-intensity physical labor, and higher educational attainment indicates jobs requiring specialization, all contributing to higher starting salaries.

However, such assumptions necessitate the premise that X1,X2,S,EX_{1} , X_{2} , S , E are independent of each other. A visible issue is the correlation between age and final educational attainment. Except for some prodigies, it’s natural for the societal debut of a Master’s holder to be later than a Bachelor’s holder, and even later for a PhD holder. Conversely, if someone secures a job at 20, their educational level is almost certainly below college graduate. Unless for experienced professionals, age will somewhat be related to educational level, breaking the premise of independence. Even among fresh societal entrants, in Korea, men’s entry into society is usually delayed compared to women due to military service, implying gender also affects the age of employment.

Definition 1

The interaction between variables, which cannot be discerned by single variables alone, is referred to as Interaction.

The Essence of Statistical Fabrication

Assertions that simply compare averages to declare a wage gap between genders or outright claim that college graduates earn more than high school graduates are statistically dubious. While such claims might be true, to substantiate them, rather than looking at nicely decorated infographics and simple indicators, it is essential to review data from multiple perspectives and rely on results obtained through statistically meaningful analysis methods.

Those who use statistics for false propaganda often say, ’there is a statistical significance in the difference’, but in reality, most are ‘average differences’. Statistics and averages are not the same. If one is educated enough in statistics to understand interaction, make sure not to be deceived. To address the interaction with qualitative variables, multiplying the two variables is a commonly used method. For example,

Y=β0+β1X1+β2X2+β3S+γE+ε Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \gamma E + \varepsilon

to consider the interaction between age X2X_{2} and gender SS, introducing a new variable as in X2SX_{2} \cdot S allows for the consideration of age for both males and females. The new regression equation is expressed as

Y=β0+β1X1+β2X2+δ(X2S)+β3S+γE+ε Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \delta ( X_{2} \cdot S ) + \beta_{3} S + \gamma E + \varepsilon

For males, it would be

Y=β0+β1X1+β2X2+γE+ε Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \gamma E + \varepsilon

and for females,

Y=β0+β1X1+(β2+δ)X2+β3+γE+ε Y = \beta_{0} + \beta_{1} X_{1} + ( \beta_{2} + \delta ) X_{2} + \beta_{3} + \gamma E + \varepsilon

Geometrically speaking, both the intercept and slope of the regression line according to age are altered, enabling a comparison that takes gender into account. In R, setting up a regression model like this is directly analyzed. As a tip, using squared representation (^2) for all variables allows for automatic computation of all possible interactions between variables.

Practice

(Following regression analysis including qualitative variables)

20180927\_102536.png

20180920\_094549.png out1 is a regression analysis without considering interaction, out2 accounts for interaction. From the summary, if suppVC:dose is significant and the coefficient is positive, it implies that although the group consuming vitamins generally has shorter teeth length, a higher dosage could remarkably increase the teeth length in the vitamin-consuming group. While the increase in teeth length per unit increase in orange juice is 7.8117.811 \cdots, for vitamins, it is 7.811+3.904=11.7157.811 \cdots + 3.904 \cdots = 11.715 \cdots. If administered in small amounts, orange juice is beneficial, but in larger doses, vitamins aid in increasing teeth length. 1.png 2.png Plot 1 shows a regression analysis without considering interaction, while Plot 2 displays standardized residual plots for regression analysis considering interaction. Though some doubts remain with Plot 2, it’s evidently an improvement over Plot 1. In some analyses, AA and BB may turn out not significant while aba \ast b is significant. While caution is indeed needed in such cases, it’s crucial to remember that if the analysis is conducted properly, AA and BB should be included in the regression model even if they are not significant. Of course, if either AA or BB is not significant, both must be included regardless.

Code

Below is an example code. In this post, the two methods for creating out2 with the lm() function are identical.

head(ToothGrowth); tail(ToothGrowth); str(ToothGrowth)
out2<-lm(len~.^2,data=ToothGrowth); summary(out2)
out2<-lm(len~.+supp*dose,data=ToothGrowth); summary(out2)
 
win.graph(5,5); plot(rstudent(out2), main="Standardized Residual Plot 2"); abline(h=0)

  1. Hadi. (2006). Regression Analysis by Example(4th Edition): p125127. ↩︎