logo

Variable Selection Criteria in Statistical Analysis 📂Statistical Analysis

Variable Selection Criteria in Statistical Analysis

Overview

The problem of variable selection inevitably involves the subjectivity of the analyst, but a numerical indicator that helps draw as objective a conclusion as possible was needed. If such values could be calculated, it would provide a clear answer to when to stop the variable selection procedure. However, there are various types of criteria, and applying different criteria can lead to different results.

Indicators [^1]

Explained Variance R2R^2

The explained variance is calculated by R2=1 SSE  SST\displaystyle R^2 = 1 - {{ \text{ SSE } } \over { \text{ SST} }} and can be interpreted that the closer it is to 11, the better the model explains the data.

However, as a criterion for variable selection, it is not appropriate if the number of independent variables differs among the compared models.

Adjusted Explained Variance Ra2R_{a}^2

In regression analysis, increasing the number of variables means there is more data available, and the explained variance R2R^{2} will always increase. In contrast, the adjusted explained variance is calculated by Ra2=1 SSE /(np1) SST/(n1)\displaystyle R^{2}_{a} = 1 - {{ \text{ SSE } / (n - p - 1) } \over { \text{ SST} / (n - 1) }}, reflecting the number of variables.

The adjusted explained variance Ra2R^{2}_{a} applies a penalty for the number of variables, overcoming the drawback of explained variance, which becomes meaningless when the number of variables differs. Although not a popular criterion for variable selection, it is an indispensable indicator since, unlike other criteria that are relative and only meaningful when comparing models, the adjusted explained variance also indicates how well a model explains the data on its own. It can be useful if one is looking for the model with the highest explanatory power, not necessarily the optimal model.

Akaike Information Criterion AICp\text{AIC}_{p}

For pp independent variables, the Akaike Information Criterion is calculated by AICp:=nln(SSEpn)+2(p+1)\displaystyle \text{AIC}_{p} := n \ln \left( {{ \text{SSE}_{p} } \over {n}} \right) + 2(p+1) . AIC is the most favored metric in actual analysis, judging the model with the lower AIC as better.

The second term in the formula applies a penalty as pp increases, meaning the more variables there are. A downside of AIC is that comparisons become inaccurate when the sample nn differs. It might seem odd that nn could differ when analyzing with the same data but changing only the variables; however, this can be a critical issue if certain variables have many missing values.

Bayes Information Criterion  BIC p\text{ BIC }_{p}

For pp independent variables, the Bayes Information Criterion is calculated by BICp:=nln(SSEpn)+(p+1)lnn\displaystyle \text{BIC}_{p} := n \ln \left( {{ \text{SSE}_{p} } \over {n}} \right) + ( p +1 ) \ln n . Similar to AIC, but with a modified last term, it complements AIC, and likewise, the lower value is considered better.

Mallows CpC_{p}

For pp independent variables, Mallows CpC_{p} is calculated by Cp:=SSEpσ^2+(2pn)\displaystyle C_{p} := {{ \text{SSE}_{p} } \over { \hat{\sigma}^2 }} + ( 2p - n ).

CpC_{p} selects variables in a direction with less bias, and it is considered better as it gets closer to CppC_{p} \approx p. It means there is less bias. If an analysis that must be cautious about bias is being conducted, it would be useful and mathematically neat, but recently, techniques that reduce variance significantly, even at the cost of some bias, to finely tune the fitting have been gaining popularity, making it less popular.