How does the class_weight parameter in scikit-learn work?

python scikit-learn

First off, it might not be good to just go by recall alone. You can simply achieve a recall of 100% by classifying everything as the positive class.I usually suggest using AUC for selecting parameters, and then finding a threshold for the operating point (say a given precision level) that you are interested in.

For how class_weight works: It penalizes mistakes in samples of class[i] with class_weight[i] instead of 1. So higher class-weight means you want to put more emphasis on a class. From what you say it seems class 0 is 19 times more frequent than class 1. So you should increase the class_weight of class 1 relative to class 0, say {0:.1, 1:.9}.If the class_weight doesn't sum to 1, it will basically change the regularization parameter.

For how class_weight="auto" works, you can have a look at this discussion.In the dev version you can use class_weight="balanced", which is easier to understand: it basically means replicating the smaller class until you have as many samples as in the larger one, but in an implicit way.

python scikit-learn

The first answer is good for understanding how it works. But I wanted to understand how I should be using it in practice.

SUMMARY

for moderately imbalanced data WITHOUT noise, there is not much of a difference in applying class weights
for moderately imbalanced data WITH noise and strongly imbalanced, it is better to apply class weights
param class_weight="balanced" works decent in the absence of you wanting to optimize manually
with class_weight="balanced" you capture more true events (higher TRUE recall) but also you are more likely to get false alerts (lower TRUE precision)
- as a result, the total % TRUE might be higher than actual because of all the false positives
- AUC might misguide you here if the false alarms are an issue
no need to change decision threshold to the imbalance %, even for strong imbalance, ok to keep 0.5 (or somewhere around that depending on what you need)

The result might differ when using RF or GBM. sklearn does not have class_weight="balanced" for GBM but lightgbm has LGBMClassifier(is_unbalance=False)

CODE

# scikit-learn==0.21.3from sklearn import datasetsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import roc_auc_score, classification_reportimport numpy as npimport pandas as pd# case: moderate imbalanceX, y = datasets.make_classification(n_samples=50*15, n_features=5, n_informative=2, n_redundant=0, random_state=1, weights=[0.8]) #,flip_y=0.1,class_sep=0.5)np.mean(y) # 0.2LogisticRegression(C=1e9).fit(X,y).predict(X).mean() # 0.184(LogisticRegression(C=1e9).fit(X,y).predict_proba(X)[:,1]>0.5).mean() # 0.184 => same as firstLogisticRegression(C=1e9,class_weight={0:0.5,1:0.5}).fit(X,y).predict(X).mean() # 0.184 => same as firstLogisticRegression(C=1e9,class_weight={0:2,1:8}).fit(X,y).predict(X).mean() # 0.296 => seems to make things worse?LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X).mean() # 0.292 => seems to make things worse?roc_auc_score(y,LogisticRegression(C=1e9).fit(X,y).predict(X)) # 0.83roc_auc_score(y,LogisticRegression(C=1e9,class_weight={0:2,1:8}).fit(X,y).predict(X)) # 0.86 => about the sameroc_auc_score(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)) # 0.86 => about the same# case: strong imbalanceX, y = datasets.make_classification(n_samples=50*15, n_features=5, n_informative=2, n_redundant=0, random_state=1, weights=[0.95])np.mean(y) # 0.06LogisticRegression(C=1e9).fit(X,y).predict(X).mean() # 0.02(LogisticRegression(C=1e9).fit(X,y).predict_proba(X)[:,1]>0.5).mean() # 0.02 => same as firstLogisticRegression(C=1e9,class_weight={0:0.5,1:0.5}).fit(X,y).predict(X).mean() # 0.02 => same as firstLogisticRegression(C=1e9,class_weight={0:1,1:20}).fit(X,y).predict(X).mean() # 0.25 => huh??LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X).mean() # 0.22 => huh??(LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict_proba(X)[:,1]>0.5).mean() # same as lastroc_auc_score(y,LogisticRegression(C=1e9).fit(X,y).predict(X)) # 0.64roc_auc_score(y,LogisticRegression(C=1e9,class_weight={0:1,1:20}).fit(X,y).predict(X)) # 0.84 => much betterroc_auc_score(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)) # 0.85 => similar to manualroc_auc_score(y,(LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict_proba(X)[:,1]>0.5).astype(int)) # same as lastprint(classification_report(y,LogisticRegression(C=1e9).fit(X,y).predict(X)))pd.crosstab(y,LogisticRegression(C=1e9).fit(X,y).predict(X),margins=True)pd.crosstab(y,LogisticRegression(C=1e9).fit(X,y).predict(X),margins=True,normalize='index') # few prediced TRUE with only 28% TRUE recall and 86% TRUE precision so 6%*28%~=2%print(classification_report(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)))pd.crosstab(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X),margins=True)pd.crosstab(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X),margins=True,normalize='index') # 88% TRUE recall but also lot of false positives with only 23% TRUE precision, making total predicted % TRUE > actual % TRUE

CodeHunter

How does the class_weight parameter in scikit-learn work?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last