Influence of Interaction in Regression Analysis
Buildup
It is recommended to read about regression analysis including qualitative variables first.
Imagine guessing this year’s graduates’ starting salaries based on their college entrance exam scores $X_{1}$, age $X_{2}$, gender $S$, and final educational attainment $E$. Firstly, with the presence of qualitative variables, gender is defined as
$$ S = \begin{cases} 1 & ,\text{여성} \\ 0 & ,\text{남성} \end{cases} $$
and educational attainment as
$$ E_{1} = \begin{cases} 1 & ,\text{대졸} \\ 0 & ,\text{고졸} \end{cases} \\ E_{2} = \begin{cases} 1 & ,\text{석사} \\ 0 & ,\text{고졸} \end{cases} \\ E_{3} = \begin{cases} 1 & ,\text{박사} \\ 0 & ,\text{고졸} \end{cases} $$
then, the regression equation would be represented as
$$ Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \gamma_{1} E_{1} + \gamma_{2} E_{2} + \gamma_{3} E_{3} + \varepsilon $$
It is speculated that a higher college entrance exam score indicates intelligence or diligence, older age suggests more experience, men are primarily involved in high-intensity physical labor, and higher educational attainment indicates jobs requiring specialization, all contributing to higher starting salaries.
However, such assumptions necessitate the premise that $X_{1} , X_{2} , S , E$ are independent of each other. A visible issue is the correlation between age and final educational attainment. Except for some prodigies, it’s natural for the societal debut of a Master’s holder to be later than a Bachelor’s holder, and even later for a PhD holder. Conversely, if someone secures a job at 20, their educational level is almost certainly below college graduate. Unless for experienced professionals, age will somewhat be related to educational level, breaking the premise of independence. Even among fresh societal entrants, in Korea, men’s entry into society is usually delayed compared to women due to military service, implying gender also affects the age of employment.
Definition 1
The interaction between variables, which cannot be discerned by single variables alone, is referred to as Interaction.
The Essence of Statistical Fabrication
Assertions that simply compare averages to declare a wage gap between genders or outright claim that college graduates earn more than high school graduates are statistically dubious. While such claims might be true, to substantiate them, rather than looking at nicely decorated infographics and simple indicators, it is essential to review data from multiple perspectives and rely on results obtained through statistically meaningful analysis methods.
Those who use statistics for false propaganda often say, ’there is a statistical significance in the difference’, but in reality, most are ‘average differences’. Statistics and averages are not the same. If one is educated enough in statistics to understand interaction, make sure not to be deceived. To address the interaction with qualitative variables, multiplying the two variables is a commonly used method. For example,
$$ Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} S + \gamma E + \varepsilon $$
to consider the interaction between age $X_{2}$ and gender $S$, introducing a new variable as in $X_{2} \cdot S$ allows for the consideration of age for both males and females. The new regression equation is expressed as
$$ Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \delta ( X_{2} \cdot S ) + \beta_{3} S + \gamma E + \varepsilon $$
For males, it would be
$$ Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \gamma E + \varepsilon $$
and for females,
$$ Y = \beta_{0} + \beta_{1} X_{1} + ( \beta_{2} + \delta ) X_{2} + \beta_{3} + \gamma E + \varepsilon $$
Geometrically speaking, both the intercept and slope of the regression line according to age are altered, enabling a comparison that takes gender into account. In R, setting up a regression model like this is directly analyzed. As a tip, using squared representation (^2) for all variables allows for automatic computation of all possible interactions between variables.
Practice
(Following regression analysis including qualitative variables)
out1
is a regression analysis without considering interaction, out2
accounts for interaction.
From the summary, if suppVC:dose
is significant and the coefficient is positive, it implies that although the group consuming vitamins generally has shorter teeth length, a higher dosage could remarkably increase the teeth length in the vitamin-consuming group. While the increase in teeth length per unit increase in orange juice is $7.811 \cdots$, for vitamins, it is $7.811 \cdots + 3.904 \cdots = 11.715 \cdots$. If administered in small amounts, orange juice is beneficial, but in larger doses, vitamins aid in increasing teeth length.
Plot 1 shows a regression analysis without considering interaction, while Plot 2 displays standardized residual plots for regression analysis considering interaction. Though some doubts remain with Plot 2, it’s evidently an improvement over Plot 1.
In some analyses, $A$ and $B$ may turn out not significant while $a \ast b$ is significant. While caution is indeed needed in such cases, it’s crucial to remember that if the analysis is conducted properly, $A$ and $B$ should be included in the regression model even if they are not significant. Of course, if either $A$ or $B$ is not significant, both must be included regardless.
Code
Below is an example code. In this post, the two methods for creating out2
with the lm()
function are identical.
head(ToothGrowth); tail(ToothGrowth); str(ToothGrowth)
out2<-lm(len~.^2,data=ToothGrowth); summary(out2)
out2<-lm(len~.+supp*dose,data=ToothGrowth); summary(out2)
win.graph(5,5); plot(rstudent(out2), main="Standardized Residual Plot 2"); abline(h=0)
Hadi. (2006). Regression Analysis by Example(4th Edition): p125127. ↩︎