spacy entity ruler – how to order patterns

Question:

I would like to label all entities which have not been labeled by a prior pattern as "unknown".
Unfortunately the entity ruler seems not to care about the order of patterns which were provided:

import spacy
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = [
    {'label': 'Country', 'pattern': [{'lower': 'ger'}]},
    {'label': 'Unknown', 'pattern': [{'OP': '?'}]}
]
ruler.add_patterns(patterns)
doc = nlp('ger is a country')
print([(ent.text, ent.label_) for ent in doc.ents])

Expected:

[('ger', 'Country'), ('is', 'Unknown'), ('a', 'Unknown'), ('country', 'Unknown')]

Actual:

[('ger', 'Unknown'), ('is', 'Unknown'), ('a', 'Unknown'), ('country', 'Unknown')]

How can I ensure the patterns are matched in order?

Asked By: Andreas

||

Answers:

There are a couple of ways to do this. A simple one is to use two EntityRulers . By default the second won’t overwrite anything set by the first.

You could also use the relatively new SpanRuler with a custom filtering function which always prefers "unknown" entities.

Answered By: polm23

Based of polm23 answer, here a working example code:

import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.blank("en")

# Normal Entity Ruler
ruler_standard = EntityRuler(nlp, overwrite_ents=True)
ruler_standard.name = 'ruler_standard'
ruler_standard = nlp.add_pipe("entity_ruler", name='ruler_standard', config={'overwrite_ents': True})
patterns = [{'label': 'Country', 'pattern': [{'lower': 'ger'}]}, ]
ruler_standard.add_patterns(patterns)

# Unknown Entity Ruler
ruler_unknown = EntityRuler(nlp, overwrite_ents=False)
ruler_unknown.name = 'ruler_unknown'
ruler_unknown = nlp.add_pipe("entity_ruler", name='ruler_unknown', after='ruler_standard', config={'overwrite_ents': False})
patterns = [{'label': 'Unknown', 'pattern': [{"OP": "?"}]}, ]
ruler_unknown.add_patterns(patterns)


doc = nlp('ger is a country')
print([(ent.text, ent.label_) for ent in doc.ents])
# [('ger', 'Country'), ('is', 'Unknown'), ('a', 'Unknown'), ('country', 'Unknown')]
Answered By: Andreas
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.