Python NLP processing if statement not in stop words list
Question:
I’m working with NLP spacy
library and I created a function to return a list of token from a text.
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and not in stop_words:
tokens.add(word.lower_)
return list(tokens)
This function is not correct because removing stop words not working.
Everything is ok only if I delete the last condition and not in stop_words
.
How to upgrade this function to remove stop words according a defined list in addition to all other condition statement?
Answers:
I think "not in stop_words" is a boolean, what is your stop_word type like ?
If stop_words is a list, it is a syntax error.
You are writing your condition wrong. Your last elif
is equivalent to this:
condC = not in stop_words
elif condA and condB and not in condC:
...
If you try to execute this code you will get a syntax error. To check if some element is in some iterable, you need to provide that element at the left side of the keyword in
. You just have to write word
:
elif condA and condB and ... and str(word) not in stop_words:
...
You need to add stop_words to the function, which takes a list of stop words as input and then you need then modify the condition for adding words to the token list, to check if the word is in the stop_words list or not
def preprocess_text_spacy(text, stop_words):
nlp = spacy.load('en_core_web_sm')
tokens = []
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.append(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.append(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
tokens.append(word.lower_)
return tokens
Sample:
text = "This is a sample text to demonstrate the function."
stop_words = ["a", "the", "is", "are"]
tokens = preprocess_text_spacy(text, stop_words)
print(tokens)
Output:
['this', 'sample', 'text', 'to', 'demonstrate', 'function']
Your code looks fine to me, there is a small change
at the end of elif put and str(word) not in stop_words
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
print(doc)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
tokens.add(word.lower_)
return list(tokens)
I’m working with NLP spacy
library and I created a function to return a list of token from a text.
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and not in stop_words:
tokens.add(word.lower_)
return list(tokens)
This function is not correct because removing stop words not working.
Everything is ok only if I delete the last condition and not in stop_words
.
How to upgrade this function to remove stop words according a defined list in addition to all other condition statement?
I think "not in stop_words" is a boolean, what is your stop_word type like ?
If stop_words is a list, it is a syntax error.
You are writing your condition wrong. Your last elif
is equivalent to this:
condC = not in stop_words
elif condA and condB and not in condC:
...
If you try to execute this code you will get a syntax error. To check if some element is in some iterable, you need to provide that element at the left side of the keyword in
. You just have to write word
:
elif condA and condB and ... and str(word) not in stop_words:
...
You need to add stop_words to the function, which takes a list of stop words as input and then you need then modify the condition for adding words to the token list, to check if the word is in the stop_words list or not
def preprocess_text_spacy(text, stop_words):
nlp = spacy.load('en_core_web_sm')
tokens = []
doc = nlp(text)
for word in doc:
if word.is_currency:
tokens.append(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.append(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and word.lower_ not in stop_words:
tokens.append(word.lower_)
return tokens
Sample:
text = "This is a sample text to demonstrate the function."
stop_words = ["a", "the", "is", "are"]
tokens = preprocess_text_spacy(text, stop_words)
print(tokens)
Output:
['this', 'sample', 'text', 'to', 'demonstrate', 'function']
Your code looks fine to me, there is a small change
at the end of elif put and str(word) not in stop_words
import spacy
def preprocess_text_spacy(text):
stop_words = ["a", "the", "is", "are"]
nlp = spacy.load('en_core_web_sm')
tokens = set()
doc = nlp(text)
print(doc)
for word in doc:
if word.is_currency:
tokens.add(word.lower_)
elif len(word.lower_) == 1:
if word.is_digit and float(word.text) == 0:
tokens.add(word.text)
elif not word.is_punct and not word.is_space and not word.is_quote and not word.is_bracket and str(word) not in stop_words:
tokens.add(word.lower_)
return list(tokens)