Is there a way to get entire constituents using SpaCy?

Question

I guess I’m trying to navigate SpaCy’s parse tree in a more blunt way than is provided.

For instance, if I have sentences like: “He was a genius” or “The dog was green,” I want to be able to save the objects to variables (“a genius” and “green”).

token.children provides the IMMEDIATE syntactic dependents, so, for the first example, the children of “was” are “he” and “genius,” and then “a” is a child of “genius.” This isn’t so helpful if I just want the entire constituent “a genius.” I’m not sure how to reconstruct it from the token.children or if there’s a better way.

I can figure out how to match “is” and “was” using token.text (part of what I’m trying to do), but I can’t figure out how to return the whole constituent “a genius” using the info provided about children.

import spacy
nlp = spacy.load('en_core_web_sm')

sent = nlp("He was a genius.")

for token in sent:
     print(token.text, token.tag_, token.dep_, [child for child in token.children])

This is the output:

He PRP nsubj []

was VBD ROOT [He, genius, .]

a DT det []

genius NN attr [a]

. . punct []

Asked By: Will

||

Source

Answer 1

You can use Token.subtree (see the docs) to get all dependents of a given node in the dependency tree.

For example, to get all noun phrases:

import spacy

nlp = spacy.load('en')

text = "He was a genius of the best kind and his dog was green."

for token in nlp(text):
    if token.pos_ in ['NOUN', 'ADJ']:
        if token.dep_ in ['attr', 'acomp'] and token.head.lemma_ == 'be':
            # to test for only verb forms 'is' and 'was' use token.head.lower_ in ['is', 'was']
            print([t.text for t in token.subtree])

Outputs:

['a', 'genius', 'of', 'the', 'best', 'kind']
['green']

Answered By: ongenz

Answer 2

There is a library called constituent-treelib that builds on benepar, spaCy and NLTK and provides a simple way to access all the constituents of a given sentence. The following steps will guide you to this goal:

# First, install the library via: pip install constituent-treelib

from constituent_treelib import ConstituentTree

# Define the sentence from where we want to extract the constituents
sentence = "He was a genius."

# Define the language that should be considered with respect to the underlying benepar and spaCy models 
language = ConstituentTree.Language.English

# You can also specify the desired spaCy model for the language ("Small" is selected by default)
spacy_model_size = ConstituentTree.SpacyModelSize.Large

# Create the neccesary NLP pipeline that is required to instantiate a ConstituentTree object
nlp = ConstituentTree.create_pipeline(language, spacy_model_size) 

# If you wish, you can instruct the library to download and install the models automatically
# nlp = ConstituentTree.create_pipeline(language, spacy_model_size, download_models=True) 

# Now we can instantiate a ConstituentTree object and pass it the parsed sentence as well as the NLP pipeline
tree = ConstituentTree(sentence, nlp)

# Finally, we can extract all constituents from the tree  
all_phrases = tree.extract_all_phrases() 

>>> {'S': ['He was a genius .'],
>>> 'NP': ['a genius'],
>>> 'VP': ['was a genius']}

Answered By: NeuroMorphing

Is there a way to get entire constituents using SpaCy?

Question:

This is the output:

Answers: