Roc curve and cut off point. Python

You can do this using the epi package in R, however I could not find similar package or example in Python.

The optimal cut off point would be where “true positive rate” is high and the “false positive rate” is low. Based on this logic, I have pulled an example below to find optimal threshold.

Python code:

import pandas as pdimport statsmodels.api as smimport pylab as plimport numpy as npfrom sklearn.metrics import roc_curve, auc# read the data indf = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")# rename the 'rank' column because there is also a DataFrame method called 'rank'df.columns = ["admit", "gre", "gpa", "prestige"]# dummify rankdummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')# create a clean data frame for the regressioncols_to_keep = ['admit', 'gre', 'gpa']data = df[cols_to_keep].join(dummy_ranks.iloc[:, 'prestige_2':])# manually add the interceptdata['intercept'] = 1.0train_cols = data.columns[1:]# fit the modelresult = sm.Logit(data['admit'], data[train_cols]).fit()print result.summary()# Add prediction to dataframedata['pred'] = result.predict(data[train_cols])fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])roc_auc = auc(fpr, tpr)print("Area under the ROC curve : %f" % roc_auc)##################################### The optimal cut off would be where tpr is high and fpr is low# tpr - (1-fpr) is zero or near to zero is the optimal cut off point####################################i = np.arange(len(tpr)) # index for dfroc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)})roc.iloc[(roc.tf-0).abs().argsort()[:1]]# Plot tpr vs 1-fprfig, ax = pl.subplots()pl.plot(roc['tpr'])pl.plot(roc['1-fpr'], color = 'red')pl.xlabel('1-False Positive Rate')pl.ylabel('True Positive Rate')pl.title('Receiver operating characteristic')ax.set_xticklabels([])

The optimal cut off point is 0.317628, so anything above this can be labeled as 1 else 0. You can see from the output/chart that where TPR is crossing 1-FPR the TPR is 63%, FPR is 36% and TPR-(1-FPR) is nearest to zero in the current example.

Output:

        1-fpr       fpr        tf     thresholds       tpr  171  0.637363  0.362637  0.000433    0.317628     0.637795

Hope this is helpful.

Edit

To simplify and bring in re-usability, I have made a function to find the optimal probability cutoff point.

Python Code:

def Find_Optimal_Cutoff(target, predicted):    """ Find the optimal probability cutoff point for a classification model related to event rate    Parameters    ----------    target : Matrix with dependent or target data, where rows are observations    predicted : Matrix with predicted data, where rows are observations    Returns    -------         list type, with optimal cutoff value            """    fpr, tpr, threshold = roc_curve(target, predicted)    i = np.arange(len(tpr))     roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})    roc_t = roc.iloc[(roc.tf-0).abs().argsort()[:1]]    return list(roc_t['threshold']) # Add prediction probability to dataframedata['pred_proba'] = result.predict(data[train_cols])# Find optimal probability thresholdthreshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])print threshold# [0.31762762459360921]# Find prediction to the dataframe applying thresholddata['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)# Print confusion Matrixfrom sklearn.metrics import confusion_matrixconfusion_matrix(data['admit'], data['pred'])# array([[175,  98],#        [ 46,  81]])

python logistic-regression roc

Given tpr, fpr, thresholds from your question, the answer for the optimal threshold is just:

optimal_idx = np.argmax(tpr - fpr)optimal_threshold = thresholds[optimal_idx]

python logistic-regression roc

Vanilla Python Implementation of Youden's J-Score

def cutoff_youdens_j(fpr,tpr,thresholds):    j_scores = tpr-fpr    j_ordered = sorted(zip(j_scores,thresholds))    return j_ordered[-1][1]

CodeHunter

Roc curve and cut off point. Python

Python code:

Output:

Edit

Python Code:

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last