how to delete stopwords saved in a file from a dataframe
Question:
i have a txt file containing stopwords and i want to remove the stopwords from my sentences in a dataframe. I tried doing this:
f = open("stopwords.txt", "r")
stopword_list = []
for line in f:
stripped_line = line.strip()
line_list = stripped_line.split()
stopword_list.append(line_list[0])
f.close()
len(stopword_list)
tokens_without_sw = [word for word in tokenized_tweets if not word in stopword_list]
print("After stopwords removed")
print(tokens_without_sw)
but it doesn’t change anything, it doesn’t remove the stopwords on the list
Answers:
Similar to your other question,
You can use re.sub
or Series.str.replace
with a regex to look for any of the words in your stopword_list
list, surrounded by word boundaries, and replace them with nothing.
I’m assuming stopword_list
has already been read.
import re
stopword_list = ["tweet", "not"]
escaped_words = "|".join(re.escape(word) for word in stopword_list)
print(repr(escaped_words))
# 'tweet|not'
regex = fr"b({escaped_words})b"
print(repr(regex))
# '\b(tweet|not)\b'
Now, call Series.str.replace
with case=False
to do a case-insensitive match:
df = pd.DataFrame({'tweets': ['this is a tweet', 'this is not a tweet', 'no', 'Another tweet', 'Not another tweet', 'Tweet not']})
df['clean'] = df['tweets'].str.replace(regex, '', case=False, regex=True)
which gives:
tweets clean
0 this is a tweet this is a
1 this is not a tweet this is a
2 no no
3 Another tweet Another
4 Not another tweet another
5 Tweet not
Note that this leaves two spaces where a word was removed. This is easy to remove just like we removed words. In this case, the regex is simply r"s{2,}"
, which looks for two or more consecutive whitespace.
df['clean'] = df['tweets'].str.replace(regex, '', case=False, regex=True).str.replace(r"s{2,}", " ", regex=True).str.strip()
tweets clean
0 this is a tweet this is a
1 this is not a tweet this is a
2 no no
3 Another tweet Another
4 Not another tweet another
5 Tweet not
i have a txt file containing stopwords and i want to remove the stopwords from my sentences in a dataframe. I tried doing this:
f = open("stopwords.txt", "r")
stopword_list = []
for line in f:
stripped_line = line.strip()
line_list = stripped_line.split()
stopword_list.append(line_list[0])
f.close()
len(stopword_list)
tokens_without_sw = [word for word in tokenized_tweets if not word in stopword_list]
print("After stopwords removed")
print(tokens_without_sw)
but it doesn’t change anything, it doesn’t remove the stopwords on the list
Similar to your other question,
You can use re.sub
or Series.str.replace
with a regex to look for any of the words in your stopword_list
list, surrounded by word boundaries, and replace them with nothing.
I’m assuming stopword_list
has already been read.
import re
stopword_list = ["tweet", "not"]
escaped_words = "|".join(re.escape(word) for word in stopword_list)
print(repr(escaped_words))
# 'tweet|not'
regex = fr"b({escaped_words})b"
print(repr(regex))
# '\b(tweet|not)\b'
Now, call Series.str.replace
with case=False
to do a case-insensitive match:
df = pd.DataFrame({'tweets': ['this is a tweet', 'this is not a tweet', 'no', 'Another tweet', 'Not another tweet', 'Tweet not']})
df['clean'] = df['tweets'].str.replace(regex, '', case=False, regex=True)
which gives:
tweets clean
0 this is a tweet this is a
1 this is not a tweet this is a
2 no no
3 Another tweet Another
4 Not another tweet another
5 Tweet not
Note that this leaves two spaces where a word was removed. This is easy to remove just like we removed words. In this case, the regex is simply r"s{2,}"
, which looks for two or more consecutive whitespace.
df['clean'] = df['tweets'].str.replace(regex, '', case=False, regex=True).str.replace(r"s{2,}", " ", regex=True).str.strip()
tweets clean
0 this is a tweet this is a
1 this is not a tweet this is a
2 no no
3 Another tweet Another
4 Not another tweet another
5 Tweet not