Drawing ROC Curves in R
Definition
The graph that plots the False Positive Rate and True Positive Rate of a confusion matrix is called the ROC curve (Receiver Operating Characteristic Curve).
Explanation
The ROC curve is extremely useful not only for providing a clear picture of a model’s performance but also for finding the optimal cutoff and comparing different models. Let’s understand its meaning and draw an ROC curve in R using an example. The key package for this task is ROCR
.
Practical Example
Let’s load the Chile
dataset from the Car
package.
The Chile dataset is a survey about the 1988 vote on extending the rule of Chilean dictator Augusto Pinochet, capturing variables like region, population, sex, age, education, income, statusquo (support for the current regime), and vote tendency.
The vote tendency has four categories: A (abstain), U (undecided), N (no), and Y (yes). For logistic regression purposes, we change all except Y to N. In this post, Y represents ‘yes’, and N represents ’no’.
To evaluate the model’s performance, we split the data into training and test sets.
The logistic regression model obtained from the training data is shown above. Now, let’s use this model to calculate the probability that voters will vote yes, using the test data. Using the predict()
function with the newdata=test
option, we let the model calculate these probabilities. Note that type="response"
should be included to return the probabilities in the desired form.
out0<-glm(vote~.,family=binomial(),data=train); summary(out0)
p <- predict(out0, newdata=test, type="response"); p
The output is as follows.
The question is how to interpret and use these probabilities. What we actually want to determine is whether someone voted ‘yes’ or ’no’. To do this, we predict ‘yes’ for probabilities above a certain threshold and ’no’ for those below. This threshold is called the cutoff or threshold, with ‘cutoff’ being the more common term.
Optimal Cutoff
The accuracy will vary depending on how we set this cutoff. Extremes like predicting ’no’ for anything below 0.99 will match the number of actual ’no’ votes, giving a decent accuracy. Predicting ‘yes’ for anything above 0.01 will similarly match the actual ‘yes’ votes. For instance, in a country where it rains 120 out of 365 days, always predicting ‘sunny’ gives an accuracy of roughly $67\%$.
To achieve better analysis and predictions, we need to find a good cutoff. While 0.5 might seem like a straightforward threshold, this is often not optimal. Hence, to find the best cutoff, we test every possible probability value as a cutoff. By calculating the False Positive Rate and True Positive Rate for each cutoff, we can plot the ROC curve.
Though it sounds complicated, the code is straightforward. To create a confusion matrix and calculate the necessary metrics, run the following code. Unlike probabilities, we don’t need to see the output, so we’ll just explain the role of each function. The prediction()
function compares the calculated probabilities with the actual test data, test$vote
, to calculate the classification rates. The performance()
function extracts the required data from the confusion matrix metrics to plot the ROC curve.
pr <- prediction(p, test$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Test Data')
Running the code will produce the following ROC curve.
The curve plots the FPR and TPR for all possible cutoffs, with the x-axis representing the FPR and the y-axis representing the TPR. Ideally, the area under the curve should be large, and the curve should be close to the upper left corner $(0,1)$. While the cutoff values aren’t shown on the graph, they can be referenced using $alpha.values from the data structure.
Using the str()
function, we see that with a cutoff of 0.939, the TPR is 0, and the FPR is 0.00556. An FPR of 0.00556 means the rate of falsely predicting ’no’ is only $0.5%$. While this seems good, the TPR of 0 means not a single ‘yes’ was correctly predicted. Intuitively, predicting ’no’ for anyone with a probability below 0.939 sets too high a threshold for ‘yes’ predictions.
Angular ROC Curves
Some users may find their ROC curves too angular, which is simply due to a small test dataset. While having a small dataset can be a concern, the angular shape of the ROC curve isn’t. For instance, reducing the test data size to 20 results in the following shape.
Code
Below is the complete example code.
install.packages("car")
install.packages("ROCR")
library(car)
library(ROCR)
set.seed(150421)
?Chile
str(Chile)
nrow(Chile)
head(Chile); tail(Chile)
DATA<-na.omit(Chile)
DATA$vote[DATA
$vote!='Y']<-'N'
DATA$vote<-factor(DATA$vote)
head(DATA); tail(DATA)
DATANUM<-nrow(DATA)
train<-sample(1:DATANUM)<(DATANUM*0.8)
test<-DATA[!train,]; head(test)
train<-DATA[train,]; head(train)
out0<-glm(vote~.,family=binomial(),data=train); summary(out0)
p <- predict(out0, newdata=test, type="response"); p
pr <- prediction(p, test$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Test Data')
str(prf)
smalltest<-test[sample(nrow(test),20),]
p <- predict(out0, newdata=smalltest, type="response"); p
pr <- prediction(p, smalltest$vote)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
win.graph(); plot(prf, main='ROC of Small Test Data')