Python NLP processing if statement not in stop words list

Question:

I’m working with NLP spacy library and I created a function to return a list of token from a text.

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)

This function is not correct because removing stop words not working.
Everything is ok only if I delete the last condition and not in stop_words.

How to upgrade this function to remove stop words according a defined list in addition to all other condition statement?

Asked By: Lilly_Co

||

Answers:

I think "not in stop_words" is a boolean, what is your stop_word type like ?
If stop_words is a list, it is a syntax error.

Answered By: 007

You are writing your condition wrong. Your last elif is equivalent to this:

condC = not in stop_words
elif condA and condB and not in condC:
    ...

If you try to execute this code you will get a syntax error. To check if some element is in some iterable, you need to provide that element at the left side of the keyword in. You just have to write word:

elif condA and condB and ... and str(word) not in stop_words:
   ...
Answered By: Jorge Luis

You need to add stop_words to the function, which takes a list of stop words as input and then you need then modify the condition for adding words to the token list, to check if the word is in the stop_words list or not

def preprocess_text_spacy(text, stop_words):
    nlp = spacy.load('en_core_web_sm')
    tokens = []
    doc = nlp(text)
    for word in doc:
        if word.is_currency:
            tokens.append(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.append(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
            tokens.append(word.lower_)
    return tokens

Sample:

text = "This is a sample text to demonstrate the function."
stop_words = ["a", "the", "is", "are"]
tokens = preprocess_text_spacy(text, stop_words)
print(tokens)

Output:

['this', 'sample', 'text', 'to', 'demonstrate', 'function']
Answered By: Abdulmajeed

Your code looks fine to me, there is a small change

at the end of elif put and str(word) not in stop_words

import spacy    
def preprocess_text_spacy(text):
    stop_words = ["a", "the", "is", "are"]
    nlp = spacy.load('en_core_web_sm')
    tokens = set()
    doc = nlp(text)
    print(doc)
    for word in doc:
        if word.is_currency:
            tokens.add(word.lower_)
        elif len(word.lower_) == 1:
            if word.is_digit and float(word.text) == 0:
                tokens.add(word.text)
        elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
            tokens.add(word.lower_)
    return list(tokens)
Answered By: God Is One
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.