Spacy incorrectly identifying pronouns

Question:

When I try this code using Spacy, I get the desired result:

import spacy
nlp = spacy.load("en_core_web_sm")

# example 1
test = "All my stuff is at to MyBOQ"
doc = nlp(test)
for word in doc:
    if word.pos_ == 'PRON':
        print(word.text)  

The output shows All and my. However, if I add a question mark:

test = "All my stuff is at to MyBOQ?"
doc = nlp(test)
for word in doc:
    if word.pos_ == 'PRON':
        print(word.text)

now it also identifies MyBOQ as a pronoun. It should be classified as an organization name (word.pos_ == 'ORG') instead.

How do I tell Spacy not to classify MyBOQ as a pronoun? Should I just remove all punctuation before checking for pronouns?

Asked By: jmich738

||

Answers:

When running your code on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4), spaCy produces the following results for the text with and without the question mark:

                               en_core_web_sm   en_core_web_md   en_core_web_trf
All my stuff is at to MyBOQ?   All, my          my               my
All my stuff is at to MyBOQ    All, my          my               my

In this example, the word "All" is not a pronoun but rather a determiner, so only the en_core_web_md and en_core_web_trf pipelines are producing technically correct results. If you’re running an old version of spaCy I’d suggest updating the package. Alternatively, if spaCy is up-to-date, try restarting your IDE/computer to see if it stops producing erroneous results—there should be no need to remove punctuation before checking for pronouns.

Finally, Part of Speech (PoS) tags do not include organisation names (ORG). I think you’re mixing Named Entity tags with PoS tags. "MyBOQ" should be PoS tagged as a proper noun (PROPN) which the en_core_web_md and en_core_web_trf pipelines identify correctly, whereas en_core_web_sm pipeline does not (instead tagging it as a basic NOUN).

Answered By: Kyle F Hartzenberg
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.