Count words in a sentence controlling for negations

Question:

I am trying to count the number of times some words occur in a sentence while controlling for negations. In the example below, I write a very basic code where I count the number of times "w" appear in "txt". Yet, I fail to control for negations like "don’t" and/or "not".

w = ["hello", "apple"]

for word in w:
    txt = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."
    print(txt.count(word))

The code should say that it finds "apple" only times and not 4. So, I would like to add: if, n words before or after the words in "w" there is a negation, then don’t count, and otherwise.

N.B. Negations here are words like "don’t" and "not".

Can anyone help me with this?

Thanks a lot for your help!

Asked By: Rollo99

||

Answers:

Firstly, before you consider the negations/negatives, str.count might not be doing what you’re expecting.

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

text.count('apple') # Outputs: 4

But if you do:

text = "The thief grappled the pineapples and ran away with a basket of apples"

text.count('apple') # Outputs: 3

If you want to count the words, you would need to do some tokenization first to change the string into a list of strings, e.g.

from collections import Counter

import nltk
from nltk import word_tokenize

nltk.download('punkt')

text = "The thief grappled the pineapples and ran away with a basket of apples"

Counter(word_tokenize(text))['apple'] # Output: 0
Counter(word_tokenize(text))['apples'] # Output: 1

Then you would need to ask yourself does plural matters when you want to count the no. of times apple/apples occur? If so, then you would have to do some stemming or lemmatization, Stemmers vs Lemmatizers

This tutorial might be helpful: https://www.kaggle.com/code/alvations/basic-nlp-with-nltk


Assuming that you adopt lemmas and tokenizers and consider whatever you need to define what is a "word" and how to count them, you have to define what is negation and what do you want to do with the counts ultimately?

Lets go with

I want to break the text down into "chunks" or clauses that have positive and negative sentiment towards some object/nouns.

Then you would have to define what does negative/positive means, in the simplest terms you might say

anything negation words that comes near the window of the focus noun we consider as "negative" and in any other case, positive.

And if we try to code up the simplest terms of quantifying negation as above, you would first, have to

  • identify the focus word, lets take the word apple and
  • then the window, lets say 5 words before and 5 words after.

In code:

import nltk
from nltk import word_tokenize, ngrams

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

NEGATIVE_WORDS = ["don't", "do not", "not"]
# Add all the forms of tokenized negative words
NEGATIVE_WORDS += [word_tokenize(w) for w in NEGATIVE_WORDS]

def count_negation(tokens):
    return sum(1 for word in tokens if word in NEGATIVE_WORDS)

for window in ngrams(word_tokenize(text), 5): 
  if "apple" in window or "apples" in window:
    print(count_negation(window), window)

[out]:

0 ('I', 'love', 'apples', ',', 'apple')
0 ('love', 'apples', ',', 'apple', 'are')
0 ('apples', ',', 'apple', 'are', 'my')
0 (',', 'apple', 'are', 'my', 'favorite')
0 ('apple', 'are', 'my', 'favorite', 'fruit')
0 ('do', "n't", 'really', 'like', 'apples')
0 ("n't", 'really', 'like', 'apples', 'if')
0 ('really', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'too')
1 ('I', 'do', 'not', 'like', 'apples')
1 ('do', 'not', 'like', 'apples', 'if')
1 ('not', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'immature')

Q: But isn’t that kind of over-counting when I do not like apples get counted 3 times even though the sentence/clause appears once in the text?

Yes, it is over-counting, so it goes back to the question of what is the ultimate goal of counting the negations?

If the ultimate goal is to have a sentiment classifier then I think lexical approaches might not be as good as state-of-the-art language models, like:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-large"

tokenizer= AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."


prompt=f"""Do I like apples or not?
QUERY:{text}
OPTIONS:
 - Yes, I like apples
 - No, I hate apples
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
tokenize.decode(model.generate(input_ids)[0], skip_special_tokens=True)

[out]:

Yes, I like apples

Q: But what if I want to explain why the model assumes positive/negative sentiments towards apple? How can I do it without counting negations?

A: Good point, it’s an active research area to explain the outputs, so definitely, there’s no clear answer yet but take a look at https://aclanthology.org/2022.coling-1.406

Answered By: alvas
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.