How does the class_weight parameter in scikit-learn work? How does the class_weight parameter in scikit-learn work? python python

How does the class_weight parameter in scikit-learn work?


First off, it might not be good to just go by recall alone. You can simply achieve a recall of 100% by classifying everything as the positive class.I usually suggest using AUC for selecting parameters, and then finding a threshold for the operating point (say a given precision level) that you are interested in.

For how class_weight works: It penalizes mistakes in samples of class[i] with class_weight[i] instead of 1. So higher class-weight means you want to put more emphasis on a class. From what you say it seems class 0 is 19 times more frequent than class 1. So you should increase the class_weight of class 1 relative to class 0, say {0:.1, 1:.9}.If the class_weight doesn't sum to 1, it will basically change the regularization parameter.

For how class_weight="auto" works, you can have a look at this discussion.In the dev version you can use class_weight="balanced", which is easier to understand: it basically means replicating the smaller class until you have as many samples as in the larger one, but in an implicit way.


The first answer is good for understanding how it works. But I wanted to understand how I should be using it in practice.

SUMMARY

  • for moderately imbalanced data WITHOUT noise, there is not much of a difference in applying class weights
  • for moderately imbalanced data WITH noise and strongly imbalanced, it is better to apply class weights
  • param class_weight="balanced" works decent in the absence of you wanting to optimize manually
  • with class_weight="balanced" you capture more true events (higher TRUE recall) but also you are more likely to get false alerts (lower TRUE precision)
    • as a result, the total % TRUE might be higher than actual because of all the false positives
    • AUC might misguide you here if the false alarms are an issue
  • no need to change decision threshold to the imbalance %, even for strong imbalance, ok to keep 0.5 (or somewhere around that depending on what you need)

NB

The result might differ when using RF or GBM. sklearn does not have class_weight="balanced" for GBM but lightgbm has LGBMClassifier(is_unbalance=False)

CODE

# scikit-learn==0.21.3from sklearn import datasetsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import roc_auc_score, classification_reportimport numpy as npimport pandas as pd# case: moderate imbalanceX, y = datasets.make_classification(n_samples=50*15, n_features=5, n_informative=2, n_redundant=0, random_state=1, weights=[0.8]) #,flip_y=0.1,class_sep=0.5)np.mean(y) # 0.2LogisticRegression(C=1e9).fit(X,y).predict(X).mean() # 0.184(LogisticRegression(C=1e9).fit(X,y).predict_proba(X)[:,1]>0.5).mean() # 0.184 => same as firstLogisticRegression(C=1e9,class_weight={0:0.5,1:0.5}).fit(X,y).predict(X).mean() # 0.184 => same as firstLogisticRegression(C=1e9,class_weight={0:2,1:8}).fit(X,y).predict(X).mean() # 0.296 => seems to make things worse?LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X).mean() # 0.292 => seems to make things worse?roc_auc_score(y,LogisticRegression(C=1e9).fit(X,y).predict(X)) # 0.83roc_auc_score(y,LogisticRegression(C=1e9,class_weight={0:2,1:8}).fit(X,y).predict(X)) # 0.86 => about the sameroc_auc_score(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)) # 0.86 => about the same# case: strong imbalanceX, y = datasets.make_classification(n_samples=50*15, n_features=5, n_informative=2, n_redundant=0, random_state=1, weights=[0.95])np.mean(y) # 0.06LogisticRegression(C=1e9).fit(X,y).predict(X).mean() # 0.02(LogisticRegression(C=1e9).fit(X,y).predict_proba(X)[:,1]>0.5).mean() # 0.02 => same as firstLogisticRegression(C=1e9,class_weight={0:0.5,1:0.5}).fit(X,y).predict(X).mean() # 0.02 => same as firstLogisticRegression(C=1e9,class_weight={0:1,1:20}).fit(X,y).predict(X).mean() # 0.25 => huh??LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X).mean() # 0.22 => huh??(LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict_proba(X)[:,1]>0.5).mean() # same as lastroc_auc_score(y,LogisticRegression(C=1e9).fit(X,y).predict(X)) # 0.64roc_auc_score(y,LogisticRegression(C=1e9,class_weight={0:1,1:20}).fit(X,y).predict(X)) # 0.84 => much betterroc_auc_score(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)) # 0.85 => similar to manualroc_auc_score(y,(LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict_proba(X)[:,1]>0.5).astype(int)) # same as lastprint(classification_report(y,LogisticRegression(C=1e9).fit(X,y).predict(X)))pd.crosstab(y,LogisticRegression(C=1e9).fit(X,y).predict(X),margins=True)pd.crosstab(y,LogisticRegression(C=1e9).fit(X,y).predict(X),margins=True,normalize='index') # few prediced TRUE with only 28% TRUE recall and 86% TRUE precision so 6%*28%~=2%print(classification_report(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X)))pd.crosstab(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X),margins=True)pd.crosstab(y,LogisticRegression(C=1e9,class_weight="balanced").fit(X,y).predict(X),margins=True,normalize='index') # 88% TRUE recall but also lot of false positives with only 23% TRUE precision, making total predicted % TRUE > actual % TRUE