Data Science me!
The foundation of this website is that once you can calculate or create something manually, it's hard to not understand it. The ROC curve has been one of the easier concepts to understand. As the title indicates, the math is fairly simple.
To see how the ROC curve is created, let's look to the output from the logistic model. To the left are the 20 observations from the Logistic Model along with the prediction that the customer made a purchase. I.E. the Response of 0 is a customer that did not make a purchase and the Response of 1 is a customer that made a purchase.
The Probability of Y = 1 indicates that the likelihood that a Customer is predicted to make a purchase, so, the closer that number is to 1, the greater likelihood that the customer will make a purchase, based on the model. An important concept to get here is that we'll have to select Cut-off points, where predictions above the cut-off are rounded to one and assumed to be a purchase, while predictions below 1 are rounded down and assumed to be non-purchasing customers.
As noted before, R Squared as used to judge a Linear Regression model would be useless for classification or binary models, subtracting the likelihood of a purchase between 0 and 100% from an indicator variable of either 0 and 1 would not be a good way to understand how predictive the model is.
The number of algorithms available in data science is one element that makes the process so powerful. Different data sets will respond differently to different algorithms. Of course that raises its own problem, how to judge the effectiveness of various algorithms and select a finalist. With linear regression, you probably learned about R Squared, which summarizes the variation between the predicted value and actual value in the regression model. However, many data science algorithms are used to predict binary outcomes and R Squared doesn't make sense.
If you look at the Logistic, Decision Tree and Naive Bayes models on this site, they predict whether customers will or won't purchase an item, a binary outcome. In this context, R Squared is meaningless. The algorithms will predict the likelihood that an event occurred and and the only comparison that could be used would be to the 0 or 1 indicating whether that event did or did not occur. That calculation is not very meaningful, so, other approaches have been adopted. ROC Curves, abbreviated from Radar Receiver Operating Characteristic curves are one of those methods. Below are ROC curves for the Logistic, Naive Bayes and Decision Tree models from other parts of this website as created by R, however, this post will demonstrate how to create these curves manually.
What the confusion matrix summarizes is shown to the left. In the context of our model, we are trying to predict purchases, which will be a value of 1 in our data set. The two green boxes represent good predictions, in the case of the upper left quadrant, this is the likelihood that a person did NOT make a purchase. If the target variable in our model is a 0 and the model predicts a probability below a certain point, .5 for instance, then that will be a True Positive. A good prediction that the person did not buy. In the lower right quadrant, are shown the customers that made a purchase and were correctly predicted to make a purchase.
The upper right and lower left, the "false" quadrants are where the model does not make a good prediction. What is important to note in the two tables above is that as the cut-off is adjusted, items will move between correct and incorrect categories. The ROC curve captures this concept!
To the right is the confusion matrix assuming that any prediction above .7 is a purchase. Notice how this has shifted values. The upper right quadrant has dropped from a 3 down to a 1 and the lower right has dropped from a 6 down to a 5. Our model now has fewer True Positives but also fewer False Positives.
Finaly, here's a ROC Curve for all of the binary models that have been built so far. Also see the file below for more detail.
Two additional concepts are needed in order to plot the ROC Curves.
Sensitivity: This is the True Positive rate or the True Positive quadrant divided by all of the Positives. True Positive/(True Positive + False Positive). In the context of the two tables shown. This will become the True Positive Rate of our ROC curve.
For .5 cutoff: 6/(6+3) = 66.7%
For .7 cutoff: 5/(5+1) = 83.3%
Specificity: This is the True Negative rate or the True Negative quadrant divided by all of the Negatives. True Negative/(True Negative + False Negative). For the two tables above.
For .5 cutoff: 7/(7+4) = 63.3%
For .7 cutoff: 9/(9+5) = 64.3%
This is the True Negative rate for the model, however a ROC curve is the True Positive rate, plotted against the False Positive rate and this is going to be 1 - Specificity
The ROC curve by contrast is both a good way to visualize the model result as well as compare results across models. Ironically, there is no complex math, or really no math at all involved in building a ROC curve!
To create a ROC Curve, a couple of concepts are needed. The first is the Confusion Matrix to the left. The confusion matrix (which will also be covered in more detail at some point on this site) is a way of summarizing correct and incorrect predictions. The confusion matrix to the left is for the Logistic Model above at a cut-off point of .5, meaning in the table just above it, anything with a score of .5 is assumed to be a purchase and anything below is assumed to not be a purchase. Looking at the table above, focus on item 6 (Index 6), this is a non-purchase as the Response is a 0, but the probability of a purchase is predicted at .67358. This will be a False Positive. Imagine though that the cut-off point is increased to .7 (so, predictions below .7 are rounded to 0). With this change, the False Positive will swing to a True Negative. Customer did not purchase and value of .67358 rounded down to 0, Customer is predicted to not purchase.
The final step is to calculate the Sensitivity and Specificity for each of the tables to the right. These calculations are demonstrated in the file below, so, will not be repeated here. However the output of each, is below.
Make sure you understand that Cut-Off point as discussed above. Again, think of this as a rounding point where you assume a prediction of probability is rounded up or down. For this step, I ordered the probability predictions and used the first and last and the average of every two predictions in between in order to reduce the number of Cut-Off Points down to 10.
Using a pivot table, it's relatively easy to create a confusion matrix that adjusts based on the Cut-Off points. (This is in the attached file in case you want to see how that works.)
A ROC curve is interpreted by looking at how far to the upper left the curve appears. In the chart to the left, there is a 45 degree angle line drawn from the lower left to the upper right. This represents the result of a Random model, or a model no better than guessing, i.e. a curve that lies along this line would be not be a good model. A curve that is convex, or lying below the diagonal line would represent models that perform worse than a randomly generated guess and definitely not good models.
The ROC curve allows for easy visual comparison among a number of models, in this case, the Naive Bayes model appears to be the best performing of the models applied to date (much of it overlaps with the Logistic Model, however there is a bit of the red dotted line appearing above the Logistic model. In a future post, I'll explore other calculations such as AUC, the Area Under the Curve which is an even easier way to make absolute comparisons between models.