how to delete stopwords saved in a file from a dataframe

Question:

i have a txt file containing stopwords and i want to remove the stopwords from my sentences in a dataframe. I tried doing this:

f = open("stopwords.txt", "r")
stopword_list = []
for line in f:
    stripped_line = line.strip()
    line_list = stripped_line.split()
    stopword_list.append(line_list[0])
f.close()

len(stopword_list)

tokens_without_sw = [word for word in tokenized_tweets if not word in stopword_list]
print("After stopwords removed")
print(tokens_without_sw)

but it doesn’t change anything, it doesn’t remove the stopwords on the list

Asked By: Zulfi A

||

Answers:

Similar to your other question,

You can use re.sub or Series.str.replace with a regex to look for any of the words in your stopword_list list, surrounded by word boundaries, and replace them with nothing.

I’m assuming stopword_list has already been read.

import re

stopword_list = ["tweet", "not"]

escaped_words = "|".join(re.escape(word) for word in stopword_list)
print(repr(escaped_words))
# 'tweet|not'

regex = fr"b({escaped_words})b"
print(repr(regex))
# '\b(tweet|not)\b'

Now, call Series.str.replace with case=False to do a case-insensitive match:

df = pd.DataFrame({'tweets': ['this is a tweet', 'this is not a tweet', 'no', 'Another tweet', 'Not another tweet', 'Tweet not']})

df['clean'] = df['tweets'].str.replace(regex, '', case=False, regex=True)

which gives:

              tweets        clean
0      this is a tweet   this is a 
1  this is not a tweet  this is  a 
2                   no           no
3        Another tweet     Another 
4    Not another tweet     another 
5            Tweet not

Note that this leaves two spaces where a word was removed. This is easy to remove just like we removed words. In this case, the regex is simply r"s{2,}", which looks for two or more consecutive whitespace.

df['clean'] = df['tweets'].str.replace(regex, '', case=False, regex=True).str.replace(r"s{2,}", " ", regex=True).str.strip()
                tweets      clean
0      this is a tweet  this is a
1  this is not a tweet  this is a
2                   no         no
3        Another tweet    Another
4    Not another tweet    another
5            Tweet not
Answered By: Pranav Hosangadi
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.