logo

Drawing ROC Curves in R 📂Machine Learning

Drawing ROC Curves in R

Definition

The plot with the False Positive Rate and True Positive Rate of an error matrix as its axes is called the ROC curveReceiver Operating Characteristic Curve.

Explanation

An ROC curve not only provides a quick look at the performance of a model but is also useful in finding the optimal cutoff, comparing models, and more. Let’s draw an ROC curve in R and understand its meaning through an example. The key package used is ROCR.

Practice

Let’s load the Chile data from the Car package’s built-in data.

20190128\_132747.png

Chile contains survey data on a vote regarding whether to extend the dictatorship of Augusto Pinochet in Chile in 1988, capturing the vote tendency based on region, population (community population of respondents), sex, age, education, income, and status quo (support for the current situation).

20190128\_133740.png The vote tendency has four classes: A (Abstain), U (Undecided), N (No), and Y (Yes), but for logistic regression, all but Y were changed to N. In this post, Y is referred to as ‘Yes’, and N as ‘No’.

20190128\_152957.png To check the performance of the model, training data and test data were divided.

20190128\_153114.png The logistic regression model obtained from the training data is as shown above. Now, let’s insert the test data into this model to calculate the probability of voters voting ‘Yes’. Note that you should include type=“response” in the predict() function to get the result in the form of ‘probability’ we want.

out0<-glm(vote~.,family=binomial(),data=train); summary(out0)
p <- predict(out0, newdata=test, type="response"); p

The output is as follows.

20190128\_152826.png The issue is how to interpret and use this probability. What we really wanted to calculate was not the probability, but whether a person voted ‘Yes’ or not. Therefore, we want to predict those with a probability higher than a certain threshold as ‘Yes’ and those with a lower probability as ‘No’. This threshold is referred to as Cutoff or Threshold, and ‘Cutoff’ is commonly used.

Optimal Cutoff

The classification rate will differ depending on this cutoff. For example, even by predicting everyone below 0.99 as ‘No’, the classification rate will still be somewhat accurate because it matches those who voted ‘No’. Similarly, predicting everyone above 0.01 as ‘Yes’ also yields some correct predictions. For instance, if there’s a country where it rains 120 days out of a year, predicting the next day’s weather as ‘sunny’ throughout the year would still yield an accuracy of about $67\%$.

We need better analysis and prediction than this, and that’s why we aim to find a good cutoff. It might seem simple to just use 0.5 as the standard, predicting ‘Yes’ for higher values and ‘No’ for lower ones, but in many cases, this is not effective. (And, the classification rate is not always the only criterion; there could be more important measures depending on the situation.)So, to find this good cutoff, we try using all values of the calculated probability p as cutoffs. We create an error matrix with a probability of 0.89572040, calculate the False Positive Rate and True Positive Rate, do the same with 0.81810785, 0.70215561, and so on… Repeating this process and plotting the False Positive Rate on the x-axis and the True Positive Rate on the y-axis gives us the ROC Curve.

It might sound complicated, but don’t worry, it’s simple in code. To generate an error matrix and calculate the needed figures, the following code needs to be run. As we don’t need to see the actual output unlike with probabilities, I’ll only briefly explain the role of each function. The prediction() function uses the calculated probability p and the actual test data’s test$vote를 비교해서 분류율을 계산해준다. performance() 함수는 위에서 계산한 오차행렬의 수치 pr에서 필요한 데이터를 뽑아 plot() 함수에 넣으면 ROC 곡선을 그릴 수 있도록 하는 데이터를 반환해준다.

pr <- prediction(p, test$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Test Data')

위의 코드를 실행하면 다음과 같이 ROC 커브를 그려준다.

ROC.png

위 곡선은 모든 컷오프에 대해서 FPR과 TPR을 계산하고, 그것을 각각 $x$ 축의 좌표, $y$ 축의 좌표로 갖는 곡선이다. 곡선은 가능한 한 그 아래의 면적이 넓은 것이 좋다. 궤적으로 보자면 곡선과 왼쪽 위의 점 $(0,1)$ 이 가깝게 붙는 식으로 나오는 게 좋다.막상 그 컷오프는 그림에서 표현되지 않는데, 자료구조를 뜯어보면 $alpha.values 를 통해 참조할 수 있음을 알 수 있다.

20190129\_101311.png str() 함수로 뜯어보면 컷오프가 0.939 일 때 TPR이 0, FPR이 0.00556 임을 알 수 있다. FPR이 0.00556 이라는 것은 ‘반대’를 잘못 예측한 비율이 $0.5%$ 밖에 안 된다는 뜻이다. 여기까진 좋아보이지만, TPR이 0이므로 ‘찬성’을 제대로 예측한 케이스가 단 한 건도 없다는 말이 된다. 직관적으로 생각해보면 ‘찬성’할 확률이 0.939 이하인 사람을 모두 ‘반대’로 예측했기 때문에 ‘찬성’을 ‘찬성’으로 예측하는 허들이 너무 높았던 것이다. 이러니 ‘반대’는 다 맞춰도 ‘찬성’은 맞출 수가 없는 것이다.

그림이 각져서 나와요

자기가 그린 ROC 곡선은 너무 네모나서 이상하다는 사람이 있다. 그건 단순히 테스트 데이터가 적기 때문에 일어나는 일이다. 데이터가 적은 것 자체는 걱정할 수 있지만 각진 모양으로 나오는건 전혀 걱정하지 않아도 좋다. 예를 들어 이 예제에서도 테스트 데이터의 크기를 20개로 줄이면 다음과 같은 모양이 나온다.

smallroc.png

Code

아래는 예제 코드 전체다.

install.packages("car")
install.packages("ROCR")
 
library(car)
library(ROCR)
 
set.seed(150421)
 
?Chile
str(Chile)
nrow(Chile)
head(Chile); tail(Chile)
 
DATA<-na.omit(Chile)
DATA$vote[DATA
          $vote!='Y']<-'N'
DATA$vote<-factor(DATA$vote)
head(DATA); tail(DATA)
 
DATANUM<-nrow(DATA)
train<-sample(1:DATANUM)<(DATANUM*0.8)
test<-DATA[!train,]; head(test)
train<-DATA[train,]; head(train)
 
out0<-glm(vote~.,family=binomial(),data=train); summary(out0)
p <- predict(out0, newdata=test, type="response"); p
 
pr <- prediction(p, test$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Test Data')
str(prf)
 
smalltest<-test[sample(nrow(test),20),]
p <- predict(out0, newdata=smalltest, type="response"); p
pr <- prediction(p, smalltest$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Small Test Data')