logo

Multicollinearity 📂Statistical Analysis

Multicollinearity

Definition 1

Consider performing multiple regression analysis $Y \gets X_{1} , \cdots, X_{p}$. If among the independent variables $ X_{1} , \cdots, X_{p}$ there is a strong correlation between the independent variables, then it is said that there is multicollinearity.

Practice

Initially, the very idea that independent variables are dependent violates the assumptions of regression analysis and indeed leads to numerical problems that make the analysis results unreliable. It can be a task in itself to determine whether there is multicollinearity in the data.

Data Exploration

Let’s load the MplsDemo data from built-in data.

20181222\_121700.png MplsDemo is data estimating population (total population), white (percentage of white), black (percentage of black), foreignBorn (foreign-born), hhIncome (household income), poverty (poverty), and collegeGrad (rate of college graduates) by districts in Minneapolis, USA.

20181222\_123641.png 5C1DB3F02.png

After regression analysis, the hypothesis testing appears unproblematic, the explanatory power is satisfactory, and the residuals seem fine. However, if you look closely at the regression coefficients, having more white people appears to increase the number of college graduates, and having more foreign-born also increases the number of college graduates. Although this data does not have accurate information for all races, this seems odd. Also, the fact that a higher income increases the number of graduates but poverty does not impact at all is somewhat disconcerting.

20181222\_123349.png

Looking at the scatter plot, it seems normal that white and household income have a positive correlation, and black, foreign-born have a negative correlation. Poverty is ambiguous, but if anything, it shows a weak negative correlation. However, the positive regression coefficients for black and foreign-born suggest that this data has not been properly explained. Although the explanatory power itself is above 0.8, which is decent, there is definitely something unsettling. When considering multicollinearity, it’s troubling that the white population seems to have a strong correlation with other variables.

Model Modification

Let’s remove the independent variable white representing the white population ratio and run the regression analysis again.

20181222\_133921.png

The new analysis results have explained the anticipated regression relationship in a relatively sensible manner even though the explanatory power has fallen by nearly 10%. Foreign-born coefficient is still a bit odd, but since regression coefficients are not significant, it seems unnecessary to worry about them. From here, how to handle the data for more refined analysis entirely depends on the analyst.

Detection of Multicollinearity

There are situations that highly suggest the presence of multicollinearity, such as:

  1. F-test is passed, but the individual regression coefficients do not pass the t-test
  2. The signs of the regression coefficients are opposite to what was expected to the point of being contradictory
  3. When data is added or removed, the existing regression coefficients change drastically

In case 1, it’s a relief in the sense that multicollinearity has been detected. Regardless of how to handle the data and solve the problem, the fact that multicollinearity has been identified is a good thing.

Case 2 is easy to detect because it’s contrary to intuition, but depending on the ’expected’ results and data, it can be very difficult to identify multicollinearity. If you’re trying to find out factors affecting plant growth, you can somewhat predict whether sunlight, water amount, and soil quality, which are independent variables, are helpful or not. However, unraveling unknown problems in social sciences can make the analyst’s intuition unreliable. Even if there is multicollinearity, the analysis might seem correct, which could lead to analysis results that completely fail to explain the actual phenomena without proper review.

In case 3, it can be difficult to pay attention to each regression coefficient individually when there are many independent variables, and one might overlook this issue even when looking directly at it. Data here not only refers to independent variables but could also be some outliers.

Regardless of the case, the analysis might be flawed, leading to issues, and explanatory power $R^2$ can be a cue to suspect multicollinearity. However, there is no clear standard for how good the explanatory power should be, incorrect analysis can still show high explanatory power, and even this factor can be quite subjective, making it unreliable. In reality, it’s rarer for data acquired to have entirely independent variables. It might not be enough to mention multicollinearity, but it’s common for there to be some relationship, causing ambiguous effects.

As in the example, identifying through scatter plots is not always foolproof. Certainly, scatter plots can quickly reveal the relationship between two variables, but if there is a complex relationship between multiple variables as in $X_{1}+ X_{2} + X_{3} = 1$, it can be difficult to detect with the eyes.

Numerical Indicators

Naturally, one must consider numerical indicators. The most popularly used is the Variance Inflation Factor (VIF), which is useful for detecting multicollinearity. However, since VIF doesn’t follow a probability distribution, hypothesis testing is not possible. Hence, it is claimed based on an empirically set threshold, which can also be ambiguous and problematic. Regression analysis is to some extent a battle with multicollinearity.

Besides VIF, there are other indicators like the condition number obtained through principal component analysis, but these are not used frequently.

Code

The following is an example code.

install.packages('car')
library(car)
 
DATA=MplsDemo; head(DATA)
 
win.graph()
plot(DATA[,-1])
 
out0<-lm(collegeGrad~.-neighborhood,data=DATA)
summary(out0)
 
win.graph(4,4)
plot(out0$residuals, main="잔차")
 
out1<-lm(collegeGrad~.-neighborhood-white,data=DATA)
summary(out1)

  1. Hadi. (2006). Regression Analysis by Example(4th Edition): p222. ↩︎