Pandas isin() does not return anything even when the keywords exist in the dataframe

Question:

I’d like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can’t understand why the solution is not working in my case.

keywords = ['fake', 'false', 'lie']

df1:

text
19152 I think she is the Corona Virus….
19154 Boy you hate to see that. I mean seeing how it was contained and all.
19155 Tell her it’s just the fake flu, it will go away in a few days.
19235 Is this fake news?
20540 She’ll believe it’s just alternative facts.

Expected results: I’d like to select rows that have the exact keywords in my list (‘fake’, ‘false’, ‘lie). For example, in the above df, it should return rows 19155 and 19235.

str.contains()

df1[df1['text'].str.contains("|".join(keywords))]

The problem with str.contains() is that the result is not limited to the exact keywords. For example, it returns sentences with believe (e.g., row 20540) because lie is a substring of "believe"!

pandas.Series.isin

To find the rows including the exact keywords, I used pd.Series.isin:

df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]

Even though I see there are matches in df1, it doesn’t return anything.

Asked By: mOna

||

Answers:

I believe it’s because pd.Series.isin() checks if the string is in the column, and not if the string in the column contains a specific word. I just tested this code snippet:

s = pd.Series(['lama abc', 'cow', 'lama', 'beetle', 'lama',
               'hippo'], name='animal')

s.isin(['cow', 'lama'])

And as I was thinking, the first string, even containing the word ‘lama’, returns False.

Maybe try using regex? See this: searching a word in the column pandas dataframe python

Answered By: Lucas Teixeira

If text is as follows,

df1 = pd.DataFrame()
df1['text'] = [
    "Dear Kellyanne, Please seek the help of Paula White I believe ...",
    "trump saying it was under controll was a lie, ...",
    "Her mouth should hanve been ... All the lies she has told ...",
    "she'll believe ...",
    "I do believe in ...",
    "This value is false ...",
    "This value is fake ...",
    "This song is fakelove ..."
]
keywords = ['misleading', 'fake', 'false', 'lie']

First,

Simple way is this.

df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
                      text
5  This value is false ...
6   This value is fake ...

It’ll not catch the words like "believe", but can’t catch the words "lie," because of the special letter.

Second,

So if remove a special letter in the text data like

new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z]+", " ", x))
df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]

Now It can catch the word "lie,".

                                                text
1  trump saying it was under controll was a lie, ...
5                            This value is false ...
6                             This value is fake ...

Third,

It can’t still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python

Answered By: Lazyer

I think splitting words then matching is a better and straightforward approach, e.g. if the df and keywords are

df = pd.DataFrame({'text': ['lama abc', 'cow def', 'foo bar', 'spam egg']})
keywords = ['foo', 'lama']

df

       text
0  lama abc
1   cow def
2   foo bar
3  spam egg

This should return the correct result

df.loc[pd.Series(any(word in keywords for word in words) for words in df['text'].str.findall(r'w+'))]

       text
0  lama abc
2   foo bar

Explaination

First, do words splitting in df['text']

splits = df['text'].str.findall(r'w+')

splits is

0    [lama, abc]
1     [cow, def]
2     [foo, bar]
3    [spam, egg]
Name: text, dtype: object

Then we need to find if there exists any word in a row should appear in the keywords

# this is answer for a single row, if words is the split list of that row
any(word in keywords for word in words)

# for the entire dataframe, use a Series, `splits` from above is word split lists for every line
rows = pd.Series(any(word in keywords for word in words) for words in splits)
rows

0     True
1    False
2     True
3    False
dtype: bool

Now we can find the correct rows with

df.loc[rows]

       text
0  lama abc
2   foo bar

Be aware this approach could consume much more memory as it needs to generate the split list on each line. So if you have huge data sets, this might be a problem.

Answered By: Brandon
import re

df[df.text.apply(lambda x: any(i for i in re.findall('w+', x) if i in keywords))]

Output:

                                                text
2  Tell her it’s just the fake flu, it will go aw...
3                                 Is this fake news?
Answered By: BeRT2me
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.