Given a word can we get all possible lemmas for it using Spacy?

Question:

The input word is standalone and not part of a sentence but I would like to get all of its possible lemmas as if the input word were in different sentences with all possible POS tags. I would also like to get the lookup version of the word’s lemma.

Why am I doing this?

I have extracted lemmas from all the documents and I have also calculated the number of dependency links between lemmas. Both of which I have done using en_core_web_sm. Now, given an input word, I would like to return the lemmas that are linked most frequently to all the possible lemmas of the input word.

So in short, I would like to replicate the behaviour of token._lemma for the input word with all possible POS tags to maintain consistency with the lemma links I have counted.

Asked By: PSK

||

Answers:

I found it difficult to get lemmas and inflections directly out of spaCy without first constructing an example sentence to give it context. This wasn’t ideal, so I looked further and found LemmaInflect did this very well.

> from lemminflect import getAllLemmas, getInflection, getAllInflections, getAllInflectionsOOV

> getAllLemmas('watches')
{'NOUN': ('watch',), 'VERB': ('watch',)}

> getAllInflections('watch')
{'NN': ('watch',), 'NNS': ('watches', 'watch'), 'VB': ('watch',), 'VBD': ('watched',), 'VBG': ('watching',), 'VBZ': ('watches',),  'VBP': ('watch',)}
Answered By: jitters

spaCy is just not designed to do this – it’s made for analyzing text, not producing text.

The linked library looks good, but if you want to stick with spaCy or need languages besides English, you can look at spacy-lookups-data, which is the raw data used for lemmas. Generally there will be a dictionary for each part of speech that lets you look up the lemma for a word.

Answered By: polm23

To get alternative lemmas, I am trying a combination of Spacy rule_lemmatize and Spacy lookup data. rule_lemmatize may produce more than one valid lemma whereas the lookup data will only offer one lemma for a given word (in the files I have inspected). There are however cases where the lookup data produces a lemma whilst rule_lemmatize does not.

My examples are for Spanish:

import spacy
import spacy_lookups_data

import json
import pathlib

# text = "fui"
text = "seguid"
# text = "contenta"
print("Input text: tt" + text)

# Find lemmas using rules:
nlp = spacy.load("es_core_news_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
doc = nlp(text)
rule_lemmas = lemmatizer.rule_lemmatize(doc[0])
print("Lemmas using rules: " + ", ".join(rule_lemmas))

# Find lemma using lookup:
lookups_path = str(pathlib.Path(spacy_lookups_data.__file__).parent.resolve()) + "/data/es_lemma_lookup.json"
fileObject = open(lookups_path, "r")
lookup_json = fileObject.read()
lookup = json.loads(lookup_json)
print("Lemma from lookup: t" + lookup[text])

Output:

Input text:         fui        # I went; I was (two verbs with same form in this tense)
Lemmas using rules: ir, ser    # to go, to be (both possible lemmas returned)
Lemma from lookup:  ser        # to be

Input text:         seguid     # Follow! (imperative)
Lemmas using rules: seguid     # Follow! (lemma not returned) 
Lemma from lookup:  seguir     # to follow

Input text:         contenta   # (it) satisfies (verb); contented (adjective) 
Lemmas using rules: contentar  # to satisfy (verb but not adjective lemma returned)
Lemma from lookup:  contento   # contented (adjective, lemma form)
Answered By: neilt17