Problem to extract NER subject + verb with spacy and Matcher

Question:

I work on an NLP project and i have to use spacy and spacy Matcher to extract all named entities who are nsubj (subjects) and the verb to which it relates : the governor verb of my NE nsubj.
Example :

Georges and his friends live in Mexico City
"Hello !", says Mary

I’ll need to extract "Georges" and "live" in the first sentence and "Mary" and "says" in the second one but i don’t know how many words will be between my named entity and the verb to which it relate. So i decided to explore spacy Matcher more.
So i’m struggling to write a pattern on Matcher to extract my 2 words. When the NE subj is before the verb, i get good results but i don’t know how to write a pattern to match a NE subj after words which it correlates to. I could also, according to the guideline, do this task with "regular spacy" but i don’t know how to do that. The problem with Matcher concerns the fact that i can’t manage the type of dependency between the NE and VERB and grab the good VERB. I’m new with spacy, i’ve always worked with NLTK or Jieba (for chineese). I don’t know even how to tokenize a text in sentence with spacy. But i chose to split the whole text in sentences to avoir bad matching between two sentences.
Here is my code

import spacy
from nltk import sent_tokenize
from spacy.matcher import Matcher

nlp = spacy.load('fr_core_news_md')

matcher = Matcher(nlp.vocab)

def get_entities_verbs():

    try:

        # subjet before verb
        pattern_subj_verb = [{'ENT_TYPE': 'PER', 'DEP': 'nsubj'}, {"POS": {'NOT_IN':['VERB']}, "DEP": {'NOT_IN':['nsubj']}, 'OP':'*'}, {'POS':'VERB'}]
        # subjet after verb
        # this pattern is not good

        matcher.add('ent-verb', [pattern_subj_verb])

        for sent in sent_tokenize(open('Le_Ventre_de_Paris-short.txt').read()):
            sent = nlp(sent)
            matches = matcher(sent)
            for match_id, start, end in matches:
                span = sent[start:end]
                print(span)

    except Exception as error:
        print(error)


def main():

    get_entities_verbs()

if __name__ == '__main__':
    main()

Even if it’s french, i can assert you that i get good results

Florent regardait
Lacaille reparut
Florent baissait
Claude regardait
Florent resta
Florent, soulagé
Claude s’était arrêté
Claude en riait
Saget est matinale, dit
Florent allait
Murillo peignait
Florent accablé
Claude entra
Claude l’appelait
Florent regardait
Florent but son verre de punch ; il le sentit
Alexandre, dit
Florent levait
Claude était ravi
Claude et Florent revinrent
Claude, les mains dans les poches, sifflant

I have some wrong results but 90% is good. I just need to grab the first ans last word of each line to have my couple NE/verb.
So my question is. How to extract NE when NE is subj with the verb which it correlates to with Matcher or simply how to do that with spacy (not Matcher) ? There are to many factors to be taken into account. Do you have a method to get the best results as possible even if 100% is not possible.
I need a pattern matching VERB governor + NER subj after from this pattern:

pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

All credit to polm23 for this pattern

Asked By: Etienne Armangau

||

Answers:

This is a perfect use case for the Dependency Matcher. It also makes things easier if you merge entities to single tokens before running it. This code should do what you need:

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")

# merge entities to simplify this
nlp.add_pipe("merge_entities")


pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PERVERB", [pattern])

texts = [
        "John Smith and some other guy live there",
        '"Hello!", says Mary.',
        ]

for text in texts:
    doc = nlp(text)
    matches = matcher(doc)

    for match in matches:
        match_id, (start, end) = match
        # note order here is defined by the pattern, so the nsubj will be first
        print(doc[start], "::", doc[end])
    print()

Check out the docs for the DependencyMatcher.

Answered By: polm23
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.