How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn? How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn? python python

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?


I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).

Class weights

The weights from the class_weight parameter are used to train the classifier.They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.

Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.

The metrics

Once you have a classifier, you want to know how well it is performing. Here you can use the metrics you mentioned: accuracy, recall_score, f1_score...

Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.

I will not detail all these metrics but note that, with the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.

             precision    recall  f1-score   support          0       0.65      1.00      0.79        17          1       0.57      0.75      0.65        16          2       0.33      0.06      0.10        17avg / total       0.52      0.60      0.51        50

The warning

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data  or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".

You get this warning because you are using the f1-score, recall and precision without defining how they should be computed!The question could be rephrased: from the above classification report, how do you output one global number for the f1-score?You could:

  1. Take the average of the f1-score for each class: that's the avg / total result above. It's also called macro averaging.
  2. Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka micro averaging.
  3. Compute a weighted average of the f1-score. Using 'weighted' in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.

These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average argument for the score method.

Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you'll get more importance for the class 5.

The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.

Computing scores

Last thing I want to mention (feel free to skip it if you're aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen. This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.

Here's a way to do it using StratifiedShuffleSplit, which gives you a random splits of your data (after shuffling) that preserve the label distribution.

from sklearn.datasets import make_classificationfrom sklearn.cross_validation import StratifiedShuffleSplitfrom sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix# We use a utility to generate artificial classification data.X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)for train_idx, test_idx in sss:    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]    svc.fit(X_train, y_train)    y_pred = svc.predict(X_test)    print(f1_score(y_test, y_pred, average="macro"))    print(precision_score(y_test, y_pred, average="macro"))    print(recall_score(y_test, y_pred, average="macro"))    

Hope this helps.


Lot of very detailed answers here but I don't think you are answering the right questions. As I understand the question, there are two concerns:

  1. How to I score a multiclass problem?
  2. How do I deal with unbalanced data?

1.

You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:

from sklearn.metrics import precision_recall_fscore_support as scorepredicted = [1,2,3,4,5,1,2,1,1,4,5] y_test = [1,2,3,4,5,1,2,1,1,4,1]precision, recall, fscore, support = score(y_test, predicted)print('precision: {}'.format(precision))print('recall: {}'.format(recall))print('fscore: {}'.format(fscore))print('support: {}'.format(support))

This way you end up with tangible and interpretable numbers for each of the classes.

| Label | Precision | Recall | FScore | Support ||-------|-----------|--------|--------|---------|| 1     | 94%       | 83%    | 0.88   | 204     || 2     | 71%       | 50%    | 0.54   | 127     || ...   | ...       | ...    | ...    | ...     || 4     | 80%       | 98%    | 0.89   | 838     || 5     | 93%       | 81%    | 0.91   | 1190    |

Then...

2.

... you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread.However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.


Posed question

Responding to the question 'what metric should be used for multi-class classification with imbalanced data': Macro-F1-measure. Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.

Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.

Weighting averaging isn't well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through:

Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information Processing & Management 45.4 (2009): 427-437.

Application-specific question

However, returning to your task, I'd research 2 topics:

  1. metrics commonly used for your specific task - it lets (a) tocompare your method with others and understand if you do somethingwrong, and (b) to not explore this by yourself and reuse someoneelse's findings;
  2. cost of different errors of your methods - forexample, use-case of your application may rely on 4- and 5-starreviewes only - in this case, good metric should count only these 2labels.

Commonly used metrics.As I can infer after looking through literature, there are 2 main evaluation metrics:

  1. Accuracy, which is used, e.g. in

Yu, April, and Daryl Chang. "Multiclass Sentiment Prediction using Yelp Business."

(link) - note that the authors work with almost the same distribution of ratings, see Figure 5.

Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales." Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.

(link)

  1. MSE (or, less often, Mean Absolute Error - MAE) - see, for example,

Lee, Moontae, and R. Grafe. "Multiclass sentiment analysis with restaurant reviews." Final Projects from CS N 224 (2010).

(link) - they explore both accuracy and MSE, considering the latter to be better

Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. "Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis." Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.

(link) - they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can't find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.

Cost of different errors.If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE; if difference matters, but not so much, try MAE, since it doesn't square diff; otherwise stay with Accuracy.

About approaches, not metrics

Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.