Spacy, how not to remove "not" when cleaning the text with space

Question:

I use this spacy code to later apply it on my text, but i need the negative words to stay in the text like "not".

nlp = spacy.load("en_core_web_sm") 

def my_tokenizer(sentence): 
    return [token.lemma_ for token in tqdm(nlp(sentence.lower()), leave = False) if token.is_stop == False and token.is_alpha == True and  token.lemma_ ] 

Whit this when i apply i get this as a result :

[hello, earphone, work]

However the original sentence was

hello,my earphones are still not working.

So, i would like to see the following sentence: [earphone, still, not, work]
Thank you

Asked By: Giorgi

||

Answers:

"not" is actually a stop word and in your code if a token is removed if it’s a stopword. You can see this either by looking at the list of Spacy stopwords

"not" in spacy.lang.en.stop_words.STOP_WORDS

or by looping over the tokens of your doc object

for tok in nlp(text.lower()):
  print(tok.text, tok.is_stop, tok.lemma_)

#hello False hello
#, False ,
#my True my
#earphones False earphone
#are True be
#still True still
#not True not
#working False work
#. False .

Solution

To solve this, you should remove the target words such as "not" from the list of stop_words. You can do it this way:

# spacy.lang.en.stop_words.STOP_WORDS.remove("not")
# or for multiple words use this
to_del_elements = {"not", "no"}
nlp.Defaults.stop_words = nlp.Defaults.stop_words - to_del_elements

Then you can rerun your code and you’ll get your expected results:

import spacy
#spacy.lang.en.stop_words.STOP_WORDS.remove("not")
to_del_elements = {"not", "no"}
nlp.Defaults.stop_words = nlp.Defaults.stop_words - to_del_elements
nlp = spacy.load("en_core_web_sm") 
def my_tokenizer(sentence): 
    return [token.lemma_ for token in tqdm(nlp(sentence.lower()), leave = False) if token.is_stop == False and token.is_alpha == True and  token.lemma_ ] 

sentence = "hello,my earphones are still not working. no way they will work"
results = my_tokenizer(sentence)
print(results)

#['hello', 'earphone', 'not', 'work', 'no', 'way', 'work']
Answered By: Hannibal
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.