Variable Selection Criteria in Statistical Analysis
Overview
The problem of variable selection inevitably involves the subjectivity of the analyst, but a numerical indicator that helps draw as objective a conclusion as possible was needed. If such values could be calculated, it would provide a clear answer to when to stop the variable selection procedure. However, there are various types of criteria, and applying different criteria can lead to different results.
Indicators [^1]
Explained Variance $R^2$
The explained variance is calculated by $\displaystyle R^2 = 1 - {{ \text{ SSE } } \over { \text{ SST} }}$ and can be interpreted that the closer it is to $1$, the better the model explains the data.
However, as a criterion for variable selection, it is not appropriate if the number of independent variables differs among the compared models.
Adjusted Explained Variance $R_{a}^2$
In regression analysis, increasing the number of variables means there is more data available, and the explained variance $R^{2}$ will always increase. In contrast, the adjusted explained variance is calculated by $\displaystyle R^{2}_{a} = 1 - {{ \text{ SSE } / (n - p - 1) } \over { \text{ SST} / (n - 1) }}$, reflecting the number of variables.
The adjusted explained variance $R^{2}_{a}$ applies a penalty for the number of variables, overcoming the drawback of explained variance, which becomes meaningless when the number of variables differs. Although not a popular criterion for variable selection, it is an indispensable indicator since, unlike other criteria that are relative and only meaningful when comparing models, the adjusted explained variance also indicates how well a model explains the data on its own. It can be useful if one is looking for the model with the highest explanatory power, not necessarily the optimal model.
Akaike Information Criterion $\text{AIC}_{p}$
For $p$ independent variables, the Akaike Information Criterion is calculated by $\displaystyle \text{AIC}_{p} := n \ln \left( {{ \text{SSE}_{p} } \over {n}} \right) + 2(p+1) $. AIC is the most favored metric in actual analysis, judging the model with the lower AIC as better.
The second term in the formula applies a penalty as $p$ increases, meaning the more variables there are. A downside of AIC is that comparisons become inaccurate when the sample $n$ differs. It might seem odd that $n$ could differ when analyzing with the same data but changing only the variables; however, this can be a critical issue if certain variables have many missing values.
Bayes Information Criterion $\text{ BIC }_{p}$
For $p$ independent variables, the Bayes Information Criterion is calculated by $\displaystyle \text{BIC}_{p} := n \ln \left( {{ \text{SSE}_{p} } \over {n}} \right) + ( p +1 ) \ln n $. Similar to AIC, but with a modified last term, it complements AIC, and likewise, the lower value is considered better.
Mallows $C_{p}$
For $p$ independent variables, Mallows $C_{p}$ is calculated by $\displaystyle C_{p} := {{ \text{SSE}_{p} } \over { \hat{\sigma}^2 }} + ( 2p - n )$.
$C_{p}$ selects variables in a direction with less bias, and it is considered better as it gets closer to $C_{p} \approx p$. It means there is less bias. If an analysis that must be cautious about bias is being conducted, it would be useful and mathematically neat, but recently, techniques that reduce variance significantly, even at the cost of some bias, to finely tune the fitting have been gaining popularity, making it less popular.