How to set a threshold for a sklearn classifier based on ROC results? How to set a threshold for a sklearn classifier based on ROC results? python python

How to set a threshold for a sklearn classifier based on ROC results?


This is what I have done:

model = SomeSklearnModel()model.fit(X_train, y_train)predict = model.predict(X_test)predict_probabilities = model.predict_proba(X_test)fpr, tpr, _ = roc_curve(y_test, predict_probabilities)

However, I am annoyed that predict chooses a threshold corresponding to 0.4% of true positives (false positives are zero). The ROC curve shows a threshold I like better for my problem where the true positives are approximately 20% (false positive around 4%). I then scan the predict_probabilities to find what probability value corresponds to my favourite ROC point. In my case this probability is 0.21. Then I create my own predict array:

predict_mine = np.where(rf_predict_probabilities > 0.21, 1, 0)

and there you go:

confusion_matrix(y_test, predict_mine)

returns what I wanted:

array([[6927,  309],       [ 621,  121]])


It's difficult to provide an exact answer without any specific code examples. If you're already doing cross validation, you might consider specifying the AUC as the parameter to optimize:

shuffle = cross_validation.KFold(len(X_train), n_folds=10, shuffle=True)scores = cross_val_score(classifier, X_train, y_train, cv=shuffle, scoring='roc_auc')