Evaluation in a Spacy NER model Evaluation in a Spacy NER model python python

Evaluation in a Spacy NER model


You can find different metrics including F-score, recall and precision in spaCy/scorer.py.

This example shows how you can use it:

import spacyfrom spacy.gold import GoldParsefrom spacy.scorer import Scorerdef evaluate(ner_model, examples):    scorer = Scorer()    for input_, annot in examples:        doc_gold_text = ner_model.make_doc(input_)        gold = GoldParse(doc_gold_text, entities=annot)        pred_value = ner_model(input_)        scorer.score(pred_value, gold)    return scorer.scores# example runexamples = [    ('Who is Shaka Khan?',     [(7, 17, 'PERSON')]),    ('I like London and Berlin.',     [(7, 13, 'LOC'), (18, 24, 'LOC')])]ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'results = evaluate(ner_model, examples)

The scorer.scores returns multiple scores. When running the example, the result looks like this: (Note the low scores occuring because the examples classify London and Berlin as 'LOC' while the model classifies them as 'GPE'. You can figure this out by looking at the ents_per_type.)

{'uas': 0.0, 'las': 0.0, 'las_per_type': {'attr': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'root': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'compound': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'nsubj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'dobj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'cc': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'conj': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'ents_p': 33.33333333333333, 'ents_r': 33.33333333333333, 'ents_f': 33.33333333333333, 'ents_per_type': {'PERSON': {'p': 100.0, 'r': 100.0, 'f': 100.0}, 'LOC': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'GPE': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'tags_acc': 0.0, 'token_acc': 100.0, 'textcat_score': 0.0, 'textcats_per_cat': {}}

The example is taken from a spaCy example on github (link does not work anymore). It was last tested with spacy 2.2.4.


Note that in spaCy v3 there is an evaluate command you can use easily from the command line instead of writing custom code to handle things.


since i faced the same problem, i am going to post here the code for the example showed in the accepted answer, but for spacy V3:

import spacyfrom spacy.scorer import Scorerfrom spacy.tokens import Docfrom spacy.training.example import Exampleexamples = [    ('Who is Shaka Khan?',     {(7, 17, 'PERSON')}),    ('I like London and Berlin.',     {(7, 13, 'LOC'), (18, 24, 'LOC')})]def evaluate(ner_model, examples):    scorer = Scorer()    example = []    for input_, annot in examples:        pred = ner_model(input_)        print(pred,annot)        temp = Example.from_dict(pred, dict.fromkeys(annot))        example.append(temp)    scores = scorer.score(example)    return scoresner_model = spacy.load('en_core_web_sm') # for spaCy's pretrained use 'en_core_web_sm'results = evaluate(ner_model, examples)print(results)

Breaking changes ocurred because libraries such as goldParse deprecated

I believe the part of the answer about metrics is still valid