Evaluation in a Spacy NER model


I am trying to evaluate a trained NER Model created using spacy lib.
Normally for these kind of problems you can use f1 score (a ratio between precision and recall). I could not find in the documentation an accuracy function for a trained NER model.

I am not sure if it’s correct but I am trying to do it with the following way(example) and using f1_score from sklearn:

from sklearn.metrics import f1_score
import spacy
from spacy.gold import GoldParse

nlp = spacy.load("en") #load NER model
test_text = "my name is John" # text to test accuracy
doc_to_test = nlp(test_text) # transform the text to spacy doc format

# we create a golden doc where we know the tagged entity for the text to be tested
doc_gold_text= nlp.make_doc(test_text)
entity_offsets_of_gold_text = [(11, 15,"PERSON")]
gold = GoldParse(doc_gold_text, entities=entity_offsets_of_gold_text)

# bring the data in a format acceptable for sklearn f1 function
y_true = ["PERSON" if "PERSON" in x else 'O' for x in gold.ner]
y_predicted = [x.ent_type_ if x.ent_type_ !='' else 'O' for x in doc_to_test]
f1_score(y_true, y_predicted, average='macro')`[1]
> 1.0

Any thoughts are or insights are useful.

Asked By: Mpizos Dimitris



You can find different metrics including F-score, recall and precision in spaCy/scorer.py.

This example shows how you can use it:

import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def evaluate(ner_model, examples):
    scorer = Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot)
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

# example run

examples = [
    ('Who is Shaka Khan?',
     [(7, 17, 'PERSON')]),
    ('I like London and Berlin.',
     [(7, 13, 'LOC'), (18, 24, 'LOC')])

ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)

The scorer.scores returns multiple scores. When running the example, the result looks like this: (Note the low scores occuring because the examples classify London and Berlin as ‘LOC’ while the model classifies them as ‘GPE’. You can figure this out by looking at the ents_per_type.)

{'uas': 0.0, 'las': 0.0, 'las_per_type': {'attr': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'root': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'compound': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'nsubj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'dobj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'cc': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'conj': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'ents_p': 33.33333333333333, 'ents_r': 33.33333333333333, 'ents_f': 33.33333333333333, 'ents_per_type': {'PERSON': {'p': 100.0, 'r': 100.0, 'f': 100.0}, 'LOC': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'GPE': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'tags_acc': 0.0, 'token_acc': 100.0, 'textcat_score': 0.0, 'textcats_per_cat': {}}

The example is taken from a spaCy example on github (link does not work anymore). It was last tested with spacy 2.2.4.

Answered By: Mpizos Dimitris

since i faced the same problem, i am going to post here the code for the example showed in the accepted answer, but for spacy V3:

import spacy
from spacy.scorer import Scorer
from spacy.tokens import Doc
from spacy.training.example import Example

examples = [
    ('Who is Shaka Khan?',
     {(7, 17, 'PERSON')}),
    ('I like London and Berlin.',
     {(7, 13, 'LOC'), (18, 24, 'LOC')})

def evaluate(ner_model, examples):
    scorer = Scorer()
    example = []
    for input_, annot in examples:
        pred = ner_model(input_)
        temp = Example.from_dict(pred, dict.fromkeys(annot))
    scores = scorer.score(example)
    return scores

ner_model = spacy.load('en_core_web_sm') # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)

Breaking changes ocurred because libraries such as goldParse deprecated

I believe the part of the answer about metrics is still valid

Answered By: miguelik

Note that in spaCy v3 there is an evaluate command you can use easily from the command line instead of writing custom code to handle things.

Answered By: polm23

This is how I used to calculate accuracy for my Spacy’s Custom NER model

def flat_accuracy(text, annotations):
    actual_ents = [ents[2] for ents in annotations]
    prediction = nlp_ner(text)
    pred_ents = [ent.text for ent in prediction.ents]
    return 1 if actual_ents == pred_ents else 0

predict_points = sum(flat_accuracy(test_text[0], test_text[1]) for test_text in examples)
output = (predict_points/len(examples)) * 100
output --> 82%
Answered By: jeevu94

I searched for many solutions on the internet but failed to find any working solution. Now that I was able to figure out the root of the problem, I am sharing my code, similar to the original question. I hope someone can still find it useful. It works with SpaCy V3.3.

from spacy.scorer import Scorer
from spacy.training import Example

def evaluate(ner_model, samples):
    scorer = Scorer(ner_model)
    example = []
    for sample in samples:
        pred = ner_model(sample['text'])
        print(pred, sample['entities'])
        temp_ex = Example.from_dict(pred, {'entities': sample['entities']})
    scores = scorer.score(example)
    return scores

Note: samples should be a valid spacy v3 formatted JSON data like below:

{'text': '#Causes - Quinsy - CA0K.1nPeri Tonsillar Abscess is usually a complication of an untreated or partially treated acute tonsillitis. The infection, in these cases, spreads to the peritonsillar area (peritonsillitis). This region comprises loose connective tissue and is hence susceptible to formation of abscess.', 'entities': [(10, 16, 'Disease_E'), (26, 48, 'Disease_E'), (112, 129, 'Complication_E'), (177, 213, 'Anatomy_E'), (237, 260, 'Anatomy_E'), (302, 309, 'Disease_E')]}
Answered By: A'r SHAON
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.