logo

Variable Selection Procedures in Statistical Analysis 📂Statistical Analysis

Variable Selection Procedures in Statistical Analysis

Buildup

Let’s consider doing a multiple regression analysis $Y \gets X_{1} , \cdots, X_{p}$. Here, if we have $p$ independent variables, it would be ideal if they satisfy the various assumptions of regression analysis well, there is no multicollinearity, and the explanatory power is high.

Of course, the more information, the better, but a regression model obtained from too much data also requires a lot of data to use. Therefore, it is better to simplify the model by reducing the number of independent variables if possible. Instead of ‘maximizing,’ ‘optimizing’ is more important in understanding and utilizing the model considering the ‘cost.’ The necessary work for this is to remove unnecessary independent variables.

For example, if the analysis using $p=7$ variables results in an explanatory power of $R_{p}^{2} = 89 %$, that is initially a good thing. However, if using only $q=2$ independent variables also yields an explanatory power of about $R_{q}^{2} = 87 %$, it might be much better to only use two.

In this case, the model obtained with all $p$ independent variables is called the Full Model FM, and the model obtained by reducing the independent variables to $q$ is called the Reduced Model RM. The model with no independent variables is called the Null Model NM, and finding the most appropriate Reduced Model among those between the Null Model and Full Model is the issue. Fortunately, understanding this process does not require mathematical knowledge. Simply put, it’s just ’examining all possible cases.’

If the number of independent variables is $p$, then the number of cases is $p^p$. In fact, with improved computers these days, it’s actually possible to look at every case unless it’s what’s known as big data, but in the past, the following methods were devised to reduce the calculations, even a little.

Variable Selection Methods 1

Forward Selection Procedure

Starting from the Null Model $Y \gets 1$, variables are added one at a time to select the model. If the smallest t-test statistic is less than $1$, the addition of variables is stopped, making it simple and quite effective.

Backward Elimination Procedure

Variables are eliminated one by one from the Full Model $Y \gets X_{1} , \cdots, X_{p}$ to select the model. If the smallest t-test statistic is less than $1$, the addition of variables is stopped, making it simple and quite effective.

Stepwise Method

From the first Reduced Model RM1 proposed by the analyst, existing variables are removed, and already removed variables are added, selecting the best model among them as the next model. Combining the advantages of the Forward Selection and Backward Elimination methods, it requires meticulous calculation, and can be seen as having a downside in terms of speed. However, it’s widely used as the best method if the computational burden and speed are not issues.

Manual

The variable selection procedures introduced above are quite good methods, as long as there is no multicollinearity. Most regression analyses end with this process, and unless the analysis is peculiarly strange or difficult, it does not deviate significantly from the below manual.

  • Step 1.
    Data HandlingDirectly examining the data for missing values, outliers, and areas that require transformation.
  • Step 2.
    Understanding the Relationship Between VariablesInspecting the model diagnostics, VIF of the Full Model obtained using all variables, and identifying interaction or multicollinearity.
  • Step 3.
    Determining the Full ModelConducting model diagnostics on the regression model with the multicollinearity issue resolved, and deciding on a new Full Model.
  • Step 4.
    Selecting VariablesApplying the Stepwise Method and selecting the best Reduced Model according to the variable selection criteria.
  • Step 5.
    Determining the Optimal ModelIf the selected Reduced Model passes the model diagnostics without issues, it is determined as the Optimal Model.

Practice

(Following the method to find multicollinearity)

20181230\_164705.png 20181230\_164959.png

From the previous post, removing the white variable seemed to lead to a proper regression analysis. Indeed, calculating the VIF shows that multicollinearity is gone, and there do not appear to be significant issues with the residual plots. This completes Step 3 in the above manual.

Now, to eliminate unnecessary variables, let’s use the stepwise selection method. Fortunately, R has the function step() for this purpose. The model entered here corresponds to the ’new Full Model’ mentioned in Step 3, which is out1 in the above example. The direction option determines the type of selection method. “forward” indicates Forward Selection, “backward” indicates Backward Selection, and “both” indicates the Stepwise Method.

20181230\_165832.png

Executing the function shows how the variable selection criterion AIC was calculated when adding and removing certain variables. Using AIC as the criterion for the stepwise selection method, it can be seen that only three variables, black, hhIncome, and poverty, are necessary. This concludes Step 4, and finally, Step 5 is completed by determining the Optimal Model.

20181230\_165943.png 20181230\_165950.png

Code

Below is the example code.

library(car)
 
DATA=MplsDemo; head(DATA)
 
out1<-lm(collegeGrad~.-neighborhood-white,data=DATA)
summary(out1); vif(out1)
 
win.graph(4,4); plot(scale(out1$residuals), main="잔차")
 
out2<-step(out1,direction="both")
summary(out2)
 
win.graph(4,4); plot(scale(out2$residuals), main="잔차")

  1. Hadi. (2006). Regression Analysis by Example(4th Edition): p289~291. ↩︎