NLTK WordNet Lemmatizer: Shouldn't it lemmatize all inflections of a word?
Question:
I’m using the NLTK WordNet Lemmatizer for a Part-of-Speech tagging project by first modifying each word in the training corpus to its stem (in place modification), and then training only on the new corpus. However, I found that the lemmatizer is not functioning as I expected it to.
For example, the word loves
is lemmatized to love
which is correct, but the word loving
remains loving
even after lemmatization. Here loving
is as in the sentence “I’m loving it”.
Isn’t love
the stem of the inflected word loving
? Similarly, many other ‘ing’ forms remain as they are after lemmatization. Is this the correct behavior?
What are some other lemmatizers that are accurate? (need not be in NLTK) Are there morphology analyzers or lemmatizers that also take into account a word’s Part Of Speech tag, in deciding the word stem? For example, the word killing
should have kill
as the stem if killing
is used as a verb, but it should have killing
as the stem if it is used as a noun (as in the killing was done by xyz
).
Answers:
The WordNet lemmatizer does take the POS tag into account, but it doesn’t magically determine it:
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving')
'loving'
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v')
u'love'
Without a POS tag, it assumes everything you feed it is a noun. So here it thinks you’re passing it the noun “loving” (as in “sweet loving”).
The best way to troubleshoot this is to actually look in Wordnet. Take a look here: Loving in wordnet. As you can see, there is actually an adjective “loving” present in Wordnet. As a matter of fact, there is even the adverb “lovingly”: lovingly in Wordnet. Because wordnet doesn’t actually know what part of speech you actually want, it defaults to noun (‘n’ in Wordnet). If you are using Penn Treebank tag set, here’s some handy function for transforming Penn to WN tags:
from nltk.corpus import wordnet as wn
def is_noun(tag):
return tag in ['NN', 'NNS', 'NNP', 'NNPS']
def is_verb(tag):
return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
def is_adverb(tag):
return tag in ['RB', 'RBR', 'RBS']
def is_adjective(tag):
return tag in ['JJ', 'JJR', 'JJS']
def penn_to_wn(tag):
if is_adjective(tag):
return wn.ADJ
elif is_noun(tag):
return wn.NOUN
elif is_adverb(tag):
return wn.ADV
elif is_verb(tag):
return wn.VERB
return None
Hope this helps.
it’s clearer and more effective than enumeration:
from nltk.corpus import wordnet
def get_wordnet_pos(self, treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
def penn_to_wn(tag):
return get_wordnet_pos(tag)
As an extension to the accepted answer from @Fred Foo
above;
from nltk import WordNetLemmatizer, pos_tag, word_tokenize
lem = WordNetLemmatizer()
word = input("Enter word:t")
# Get the single character pos constant from pos_tag like this:
pos_label = (pos_tag(word_tokenize(word))[0][1][0]).lower()
# pos_refs = {'n': ['NN', 'NNS', 'NNP', 'NNPS'],
# 'v': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],
# 'r': ['RB', 'RBR', 'RBS'],
# 'a': ['JJ', 'JJR', 'JJS']}
if pos_label == 'j': pos_label = 'a' # 'j' <--> 'a' reassignment
if pos_label in ['r']: # For adverbs it's a bit different
print(wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name())
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
print(lem.lemmatize(word, pos=pos_label))
else: # For nouns and everything else as it is the default kwarg
print(lem.lemmatize(word))
I’m using the NLTK WordNet Lemmatizer for a Part-of-Speech tagging project by first modifying each word in the training corpus to its stem (in place modification), and then training only on the new corpus. However, I found that the lemmatizer is not functioning as I expected it to.
For example, the word loves
is lemmatized to love
which is correct, but the word loving
remains loving
even after lemmatization. Here loving
is as in the sentence “I’m loving it”.
Isn’t love
the stem of the inflected word loving
? Similarly, many other ‘ing’ forms remain as they are after lemmatization. Is this the correct behavior?
What are some other lemmatizers that are accurate? (need not be in NLTK) Are there morphology analyzers or lemmatizers that also take into account a word’s Part Of Speech tag, in deciding the word stem? For example, the word killing
should have kill
as the stem if killing
is used as a verb, but it should have killing
as the stem if it is used as a noun (as in the killing was done by xyz
).
The WordNet lemmatizer does take the POS tag into account, but it doesn’t magically determine it:
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving')
'loving'
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v')
u'love'
Without a POS tag, it assumes everything you feed it is a noun. So here it thinks you’re passing it the noun “loving” (as in “sweet loving”).
The best way to troubleshoot this is to actually look in Wordnet. Take a look here: Loving in wordnet. As you can see, there is actually an adjective “loving” present in Wordnet. As a matter of fact, there is even the adverb “lovingly”: lovingly in Wordnet. Because wordnet doesn’t actually know what part of speech you actually want, it defaults to noun (‘n’ in Wordnet). If you are using Penn Treebank tag set, here’s some handy function for transforming Penn to WN tags:
from nltk.corpus import wordnet as wn
def is_noun(tag):
return tag in ['NN', 'NNS', 'NNP', 'NNPS']
def is_verb(tag):
return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
def is_adverb(tag):
return tag in ['RB', 'RBR', 'RBS']
def is_adjective(tag):
return tag in ['JJ', 'JJR', 'JJS']
def penn_to_wn(tag):
if is_adjective(tag):
return wn.ADJ
elif is_noun(tag):
return wn.NOUN
elif is_adverb(tag):
return wn.ADV
elif is_verb(tag):
return wn.VERB
return None
Hope this helps.
it’s clearer and more effective than enumeration:
from nltk.corpus import wordnet
def get_wordnet_pos(self, treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
def penn_to_wn(tag):
return get_wordnet_pos(tag)
As an extension to the accepted answer from @Fred Foo
above;
from nltk import WordNetLemmatizer, pos_tag, word_tokenize
lem = WordNetLemmatizer()
word = input("Enter word:t")
# Get the single character pos constant from pos_tag like this:
pos_label = (pos_tag(word_tokenize(word))[0][1][0]).lower()
# pos_refs = {'n': ['NN', 'NNS', 'NNP', 'NNPS'],
# 'v': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],
# 'r': ['RB', 'RBR', 'RBS'],
# 'a': ['JJ', 'JJR', 'JJS']}
if pos_label == 'j': pos_label = 'a' # 'j' <--> 'a' reassignment
if pos_label in ['r']: # For adverbs it's a bit different
print(wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name())
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
print(lem.lemmatize(word, pos=pos_label))
else: # For nouns and everything else as it is the default kwarg
print(lem.lemmatize(word))