Roc curve and cut off point. Python Roc curve and cut off point. Python python python

Roc curve and cut off point. Python


You can do this using the epi package in R, however I could not find similar package or example in Python.

The optimal cut off point would be where “true positive rate” is high and the “false positive rate” is low. Based on this logic, I have pulled an example below to find optimal threshold.

Python code:

import pandas as pdimport statsmodels.api as smimport pylab as plimport numpy as npfrom sklearn.metrics import roc_curve, auc# read the data indf = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")# rename the 'rank' column because there is also a DataFrame method called 'rank'df.columns = ["admit", "gre", "gpa", "prestige"]# dummify rankdummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')# create a clean data frame for the regressioncols_to_keep = ['admit', 'gre', 'gpa']data = df[cols_to_keep].join(dummy_ranks.iloc[:, 'prestige_2':])# manually add the interceptdata['intercept'] = 1.0train_cols = data.columns[1:]# fit the modelresult = sm.Logit(data['admit'], data[train_cols]).fit()print result.summary()# Add prediction to dataframedata['pred'] = result.predict(data[train_cols])fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])roc_auc = auc(fpr, tpr)print("Area under the ROC curve : %f" % roc_auc)##################################### The optimal cut off would be where tpr is high and fpr is low# tpr - (1-fpr) is zero or near to zero is the optimal cut off point####################################i = np.arange(len(tpr)) # index for dfroc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)})roc.iloc[(roc.tf-0).abs().argsort()[:1]]# Plot tpr vs 1-fprfig, ax = pl.subplots()pl.plot(roc['tpr'])pl.plot(roc['1-fpr'], color = 'red')pl.xlabel('1-False Positive Rate')pl.ylabel('True Positive Rate')pl.title('Receiver operating characteristic')ax.set_xticklabels([])

The optimal cut off point is 0.317628, so anything above this can be labeled as 1 else 0. You can see from the output/chart that where TPR is crossing 1-FPR the TPR is 63%, FPR is 36% and TPR-(1-FPR) is nearest to zero in the current example.

Output:

        1-fpr       fpr        tf     thresholds       tpr  171  0.637363  0.362637  0.000433    0.317628     0.637795

enter image description here

Hope this is helpful.

Edit

To simplify and bring in re-usability, I have made a function to find the optimal probability cutoff point.

Python Code:

def Find_Optimal_Cutoff(target, predicted):    """ Find the optimal probability cutoff point for a classification model related to event rate    Parameters    ----------    target : Matrix with dependent or target data, where rows are observations    predicted : Matrix with predicted data, where rows are observations    Returns    -------         list type, with optimal cutoff value            """    fpr, tpr, threshold = roc_curve(target, predicted)    i = np.arange(len(tpr))     roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})    roc_t = roc.iloc[(roc.tf-0).abs().argsort()[:1]]    return list(roc_t['threshold']) # Add prediction probability to dataframedata['pred_proba'] = result.predict(data[train_cols])# Find optimal probability thresholdthreshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])print threshold# [0.31762762459360921]# Find prediction to the dataframe applying thresholddata['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)# Print confusion Matrixfrom sklearn.metrics import confusion_matrixconfusion_matrix(data['admit'], data['pred'])# array([[175,  98],#        [ 46,  81]])


Given tpr, fpr, thresholds from your question, the answer for the optimal threshold is just:

optimal_idx = np.argmax(tpr - fpr)optimal_threshold = thresholds[optimal_idx]


Vanilla Python Implementation of Youden's J-Score

def cutoff_youdens_j(fpr,tpr,thresholds):    j_scores = tpr-fpr    j_ordered = sorted(zip(j_scores,thresholds))    return j_ordered[-1][1]