logo

Finding the optimal cutoff using ROC curves 📂Machine Learning

Finding the optimal cutoff using ROC curves

Overview

Drawing an ROC curve is useful because it gives a quick visual insight into how well the model developed from training data explains the test data. However, since this curve calculates and connects classification rates for all cutoffs, it ultimately does not provide information on ‘what cutoff to use to classify 0 and 1’. To find this out, let’s apply the methodology of cross-validation.

20190208\_140726.png

Validation Data

Even if we find the optimal cutoff that best classifies 0 and 1 in the training data, it is merely a cutoff that explains the training data well. Naturally, calculating the classification rate on the test data, that optimal cutoff is just a cutoff that explains only the test data well. Therefore, we create a separate Validation Data to find a cutoff that is unbiased against any data.

20190208\_140734.png

  • Step 1.
    Train the model with the training data.
  • Step 2.
    Apply it to the validation data to find the optimal cutoff.
  • Step 3.
    Test the model obtained from the training data with the cutoff obtained from the validation data on the test data to see how high the performance is. If there are several model candidates, select as the final model the one with the highest performance.

The model obtained in Step 1 only becomes a proper model when it includes the cutoff obtained in Step 2.

Practice

(Following the method of drawing an ROC curve)

99C508455C4FA79537.png

The above figure shows the ROC curve when the data was divided into training data and test data in the previous post. Now, to identify the optimal cutoff, it is necessary to split the data into three parts. The ratio for this division varies depending on the data, but 3:1:1 is often suitable if there are no particular issues.

DATANUM<-nrow(DATA)
numeric(DATANUM)
DATANUM*c(0.6,0.2,0.2)
  
slicing<-sample(1:DATANUM)
slicing[slicing>(DATANUM*0.8)]<-(-1)
slicing[slicing>(DATANUM*0.6)]<-(-2)
slicing[slicing>0]<-(-3)
slicing<-slicing+4
 
train<-DATA[slicing==1,]; head(train)
valid<-DATA[slicing==2,]; head(valid)
test<-DATA[slicing==3,]; head(test)

After preprocessing the data and running the above code, the data is split into three types: training, validation, and test data.

20190208\_143314.png

Step 1.

Train the model with the training data.

out0<-glm(vote~.,family=binomial(),data=train); summary(out0)
vif(out0)
 
out1<-step(out0, direction="both"); summary(out1)
qchisq(0.95,df=1454)
vif(out1)

Running the above code goes through the variable selection procedure, creates a model, and checks for fit tests and multicollinearity.

20190208\_144026.png There seems to be nothing particularly problematic with the model.

Step 2.

Apply it to the validation data to find the optimal cutoff.

p <- predict(out1, newdata=valid, type="response")
 
pr <- prediction(p, valid$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Validation Data')

위의 코드를 실행하면 마치 밸리데이터 데이터를 테스트 데이터처럼 취급해서 다음과 같이 ROC 커브를 그려준다.

vp.png 최적의 컷오프는 데이터와 목적에 따라 다르게 결정될 수 있지만, 별도의 주안점이 없다면 왼쪽 위의 $(0,1)$ 에서 가장 가까운 점을 찾아 그 점의 컷오프를 최적의 컷오프로 삼는다. 거리를 계산해야하기 때문에 코드는 다소 복잡하다.

optid<-(1:length(prf@y.values[[1]][-1]))[((prf@x.values[[1]][-1])^2 + (1-prf@y.values[[1]][-11])^2)
                                         ==min((prf@x.values[[1]][-1])^2 + (1-prf@y.values[[1]][-1])^2)]
points(prf@x.values[[1]][-1][optid],prf@y.values[[1]][-1][optid], col='red', pch=15)
optcut<-prf@alpha.values[[1]][-1][optid]; optcut

위의 코드를 실행하면 위의 설명대로 $(0,1)$ 에서 가장 가까운 점을 빨간색으로 표시해주고 그 지점의 컷오프를 출력해준다.

20190208\_144843.png vp2.png

코드가 많이 복잡하지만 이해하려고 노력할 필요는 없다. 복잡한 것과 어려운 것은 다른 일이다. 위의 코드는 길기만하지 개념적으로는 전혀 어렵지 않다. 그냥 곡선 위의 모든 점과 $(0,1)$ 사이의 거리를 잰 후 그 거리가 가장 짧은 점을 선택한 것 뿐이다. 그 점에서의 $alpha.values). You can find the cutoff by referring to alpha.values`. This cutoff can be accepted as the most appropriate cutoff to classify the data properly. (It’s important to emphasize again, this is not an absolute measure. ‘The optimal cutoff’ itself could be redefined entirely according to the user’s objective.)

The optimal cutoff obtained in this example is $0.4564142$, and it’s acceptable to consider anything above it as 1 and below it as 0.(This is the third emphasis, but it should be accepted as good to consider, not as the best. Providing a valid interpretation is entirely up to the analyst.)

Step 3.

Check how well it fits with the test data.

p <- predict(out1, newdata=test, type="response"); head(p,48)
table(test$vote, p>optcut)

위의 코드를 실행하면 테스트 데이터에서 확률을 계산해주고 최적 컷오프에 따른 오류행렬을 출력해준다.

20190208\_150337.png

위 오류행렬의 정분류율은 약 $81 \%$ 로써 꽤 쓸만하고, 분석자가 만족할만하다면 최종모형으로 받아들여봄직하다.눈치챘겠지만 엄밀히 말해 최적 컷오프를 구하는데 있어서 꼭 ROC 곡선을 그릴 필요는 없다. 어차피 계산을 위한 데이터는 데이터 프레임으로써 다 구해놨기 때문에 코드만 잘 돌려서 값만 얻어내도 전혀 상관 없다.

Code

아래는 예제 코드다.

install.packages("car")
install.packages("ResourceSelection")
install.packages("ROCR")
 
library(car)
library(ResourceSelection)
library(ROCR)
 
set.seed(150421)
 
?Chile
str(Chile)
nrow(Chile)
 
DATA<-na.omit(Chile)
DATA$vote[DATA$vote!='Y']<-'N'
DATA$vote<-factor(DATA$vote)
 
DATANUM<-nrow(DATA)
numeric(DATANUM)
DATANUM*c(0.6,0.2,0.2)
  
slicing<-sample(1:DATANUM)
slicing[slicing>(DATANUM*0.8)]<-(-1)
slicing[slicing>(DATANUM*0.6)]<-(-2)
slicing[slicing>0]<-(-3)
slicing<-slicing+4
 
train<-DATA[slicing==1,]; head(train)
valid<-DATA[slicing==2,]; head(valid)
test<-DATA[slicing==3,]; head(test)
 
out0<-glm(vote~.,family=binomial(),data=train); summary(out0)
vif(out0)
 
out1<-step(out0, direction="both"); summary(out1)
qchisq(0.95,df=1454)
vif(out1)
 
p <- predict(out1, newdata=valid, type="response")
 
pr <- prediction(p, valid$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Validation Data')
 
optid<-(1:length(prf@y.values[[1]][-1]))[((prf@x.values[[1]][-1])^2 + (1-prf@y.values[[1]][-11])^2)
                                         ==min((prf@x.values[[1]][-1])^2 + (1-prf@y.values[[1]][-1])^2)]
points(prf@x.values[[1]][-1][optid],prf@y.values[[1]][-1][optid], col='red', pch=15)
optcut<-prf@alpha.values[[1]][-1][optid]; optcut
 
p <- predict(out1, newdata=test, type="response"); head(p,48)
table(test$vote, p>optcut)