Finding the optimal cutoff using ROC curves
Overview
Drawing the ROC Curve is useful as it provides a visual representation of how well a model derived from training data explains the test data. However, since this curve computes the classification rate for all cutoffs, it cannot definitively tell us ‘which cutoff to use for classifying 0 and 1’. To address this, let’s apply the methodology of cross-validation.
Validation Data
Finding the optimal cutoff that best classifies 0 and 1 from the training data is just a cutoff that best explains the training data. Naturally, even if you calculate the classification rate from the test data, the optimal cutoff will only explain the test data well. Thus, we create a separate validation data to find a cutoff that is unbiased for any data.
- Step 1.
Train a model using the training data. - Step 2.
Apply the model to the validation data to find the optimal cutoff. - Step 3.
Using the model obtained from the training data and the cutoff obtained from the validation data, classify the test data and observe the performance. If multiple model candidates exist, select the model with the highest performance.
The model obtained in Step 1 becomes a proper model only after incorporating the cutoff obtained in Step 2.
Practice
(Continuing from Drawing the ROC Curve)
The above figure is the ROC curve obtained when the data is divided into training data and test data in the previous post. Now, to find the optimal cutoff, we need to split the data three ways. The ratio differs for each data set, but if there are no significant issues, a ratio of 3:1:1 is generally acceptable.
DATANUM<-nrow(DATA)
numeric(DATANUM)
DATANUM*c(0.6,0.2,0.2)
slicing<-sample(1:DATANUM)
slicing[slicing>(DATANUM*0.8)]<-(-1)
slicing[slicing>(DATANUM*0.6)]<-(-2)
slicing[slicing>0]<-(-3)
slicing<-slicing+4
train<-DATA[slicing==1,]; head(train)
valid<-DATA[slicing==2,]; head(valid)
test<-DATA[slicing==3,]; head(test)
After preprocessing the data and running the above code, the data will be split into three parts: training, validation, and test data.
Step 1.
Train a model using the training data.
out0<-glm(vote~.,family=binomial(),data=train); summary(out0)
vif(out0)
out1<-step(out0, direction="both"); summary(out1)
qchisq(0.95,df=1454)
vif(out1)
After running the above code, a model is built through a variable selection process and checks for goodness of fit and multicollinearity are performed.
There are no particular issues with the model.
Step 2.
Apply the model to the validation data to find the optimal cutoff.
p <- predict(out1, newdata=valid, type="response")
pr <- prediction(p, valid$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Validation Data')
After running the above code, the validation data is treated like test data, and the ROC curve is drawn as shown below.
The optimal cutoff can vary depending on the data and the objective, but if there are no specific points of interest, the closest point from $(0,1)$ in the upper-left corner is chosen as the optimal cutoff. Since distance calculations are required, the code is somewhat complex.
optid<-(1:length(prf@y.values[[1]][-1]))[((prf@x.values[[1]][-1])^2 + (1-prf@y.values[[1]][-11])^2)
==min((prf@x.values[[1]][-1])^2 + (1-prf@y.values[[1]][-1])^2)]
points(prf@x.values[[1]][-1][optid],prf@y.values[[1]][-1][optid], col='red', pch=15)
optcut<-prf@alpha.values[[1]][-1][optid]; optcut
Running the code above will mark the closest point from $(0,1)$ in red and print out the cutoff of that point.
Although the code is lengthy, it is not necessary to try to understand it in detail. Complexity and difficulty are not the same things. The above code is lengthy but conceptually straightforward. It simply measures the distance between all points on the curve and $(0,1)$, then selects the closest point. By referring to the $alpha.values
at that point, we can determine the cutoff. This cutoff can be accepted as the most appropriate for classifying the data. (Just to reiterate, this is not an absolute measure. The ‘optimal cutoff’ can be completely redefined based on the user’s objective.)
The optimal cutoff obtained in this example is $0.4564142$, where values higher than this are classified as 1 and those lower as 0. (For the third time, this is an acceptable interpretation, but not necessarily the best one. It is up to the analyst to provide a valid interpretation.)
Step 3.
Check how well it performs on the test data.
p <- predict(out1, newdata=test, type="response"); head(p,48)
table(test$vote, p>optcut)
Running the code above will calculate the probabilities for the test data and print the confusion matrix according to the optimal cutoff.
The accuracy of the confusion matrix above is approximately $81 \%$, which is quite reasonable and, if satisfactory to the analyst, can be accepted as the final model. You might have noticed that strictly speaking, it is not necessary to draw the ROC curve to find the optimal cutoff. Since the data required for computation is already obtained as a data frame, running the code to get the values is perfectly acceptable.
Code
Below is the example code.
install.packages("car")
install.packages("ResourceSelection")
install.packages("ROCR")
library(car)
library(ResourceSelection)
library(ROCR)
set.seed(150421)
?Chile
str(Chile)
nrow(Chile)
DATA<-na.omit(Chile)
DATA$vote[DATA$vote!='Y']<-'N'
DATA$vote<-factor(DATA$vote)
DATANUM<-nrow(DATA)
numeric(DATANUM)
DATANUM*c(0.6,0.2,0.2)
slicing<-sample(1:DATANUM)
slicing[slicing>(DATANUM*0.8)]<-(-1)
slicing[slicing>(DATANUM*0.6)]<-(-2)
slicing[slicing>0]<-(-3)
slicing<-slicing+4
train<-DATA[slicing==1,]; head(train)
valid<-DATA[slicing==2,]; head(valid)
test<-DATA[slicing==3,]; head(test)
out0<-glm(vote~.,family=binomial(),data=train); summary(out0)
vif(out0)
out1<-step(out0, direction="both"); summary(out1)
qchisq(0.95,df=1454)
vif(out1)
p <- predict(out1, newdata=valid, type="response")
pr <- prediction(p, valid$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Validation Data')
optid<-(1:length(prf@y.values[[1]][-1]))[((prf@x.values[[1]][-1])^2 + (1-prf@y.values[[1]][-11])^2)
==min((prf@x.values[[1]][-1])^2 + (1-prf@y.values[[1]][-1])^2)]
points(prf@x.values[[1]][-1][optid],prf@y.values[[1]][-1][optid], col='red', pch=15)
optcut<-prf@alpha.values[[1]][-1][optid]; optcut
p <- predict(out1, newdata=test, type="response"); head(p,48)
table(test$vote, p>optcut)