logo

Variance Inflation Factor VIF 📂Statistical Analysis

Variance Inflation Factor VIF

Definition 1

When performing multiple regression analysis $Y \gets X_{1} , \cdots, X_{p}$, let’s define the multiple regression coefficient for the $i$th independent variable as $R_{i}^2$. The following is called the Variance Inflation Factor for $X_{i}$. $$\displaystyle \text{VIF}_{i}: = {{1} \over {1 - R_{i}^{2} }}$$

Explanation

First, it is recommended to read about multicollinearity.

VIF is sometimes translated as the variance expansion index, but it is usually too long, so it is simply referred to as VIF.

It is widely used as a numerical indicator to identify multicollinearity that is difficult to find with only the analyst’s intuition. In regression analysis, the explanatory power is always found by $0 < R^{2} < 1$, so VIF is always greater than $1$. A high $R_{i}^2$ means that the other independent variables explain $X_{i}$ well, excluding $X_{i}$. Therefore, the stronger the multicollinearity, the larger the value of $\displaystyle \text{VIF}_{i} = {{1} \over {1 - R_{i}^{2} }}$ will be.

If the $\text{VIF}_{i}$ for the $i$th independent variable is greater than $3$ or $5$ or $10$, it is considered to have a high likelihood of multicollinearity. For those familiar with mathematics, these criteria might feel uncomfortable, as they are empirically derived, making it unclear who is right or wrong. Moreover, even if somewhat undeniable multicollinearity is found according to the data, there are often cases where nothing can be done. However, this also means that if the data is understood and the analyst himself can justify it satisfactorily, a somewhat lenient criterion can be applied.

Practice

(Following the method to find multicollinearity) 20181223\_124713.png 20181223\_124645.png In the previous post, we conducted multiple regression analysis with collgeGrad as the dependent variable and guessed that white strongly has multicollinearity. As expected, the VIF of white was found to be strong enough to exceed $10$.

However, calculating the VIF in this way every time is inconvenient, so let’s use the vif() function of the car package to calculate it.

20181223\_125428.png In the midst of suspecting multicollinearity as the result of the regression analysis differs greatly from intuition, the fact that white’s VIF exceeds $10$ almost certainly implies there is a problem with the white variable. black is also not easy, but let’s remove white first.

20181223\_125816.png The result of re-analyzing after removing white was not only quite good, but the correlation of black also improved as white disappeared. If there are no particular problems in model diagnosis, this regression model can be considered quite useful as it uses only $3$ variables and yet exceeds $70 \%$ in explanatory power.

Code

Below is an example code.

install.packages('car')
library(car)
 
DATA=MplsDemo; head(DATA)
 
win.graph()
plot(DATA[,-1])
 
out<-lm(population~.-collegeGrad-neighborhood,data=DATA)
print(paste("population의 VIF : ",(1-summary(out)$r.squared)^{-1}))
 
out<-lm(white~.-collegeGrad-neighborhood,data=DATA)
print(paste("white의 VIF : ",(1-summary(out)$r.squared)^{-1}))
 
out<-lm(black~.-collegeGrad-neighborhood,data=DATA)
print(paste("black의 VIF : ",(1-summary(out)$r.squared)^{-1}))
 
out<-lm(foreignBorn~.-collegeGrad-neighborhood,data=DATA)
print(paste("foreinBorn의 VIF : ",(1-summary(out)$r.squared)^{-1}))
 
out<-lm(hhIncome~.-collegeGrad-neighborhood,data=DATA)
print(paste("hhIncome의 VIF : ",(1-summary(out)$r.squared)^{-1}))
 
out<-lm(poverty~.-collegeGrad-neighborhood,data=DATA)
print(paste("poverty의 VIF : ",(1-summary(out)$r.squared)^{-1}))
 
out0<-lm(collegeGrad~.-neighborhood,data=DATA)
summary(out0); vif(out0)
 
out1<-lm(collegeGrad~.-neighborhood-white,data=DATA)
summary(out1); vif(out1)

  1. Hadi. (2006). Regression Analysis by Example(4th Edition): p236. ↩︎