Check if each value in a dataframe column contains words from another dataframe column

Question:

How do I iterate through each value in one dataframe column and check if it contains words in another dataframe column?

a = pd.DataFrame({'text': ['the cat jumped over the hat', 'the pope pulled on the rope', 'i lost my dog in the fog']})
b = pd.DataFrame({'dirty_words': ['cat', 'dog', 'parakeet']})

a    
    text
0   the cat jumped over the hat
1   the pope pulled on the rope
2   i lost my dog in the fog

b
    dirty_words
0   cat
1   dog
2   parakeet

I want to get a new dataframe that contains only these values:

result

0   the cat jumped over the hat
1   i lost my dog in the fog
Asked By: silverSuns

||

Answers:

Use regex matching with str.contains.

p = '|'.join(b['dirty_words'].dropna())
a[a['text'].str.contains(r'b{}b'.format(p))]

                          text
0  the cat jumped over the hat
2     i lost my dog in the fog

The word boundaries ensure you won’t match “catch” just because it contains “cat” (thanks @DSM).

Answered By: cs95

You can use a list comprehension with any after splitting strings by whitespace. This method won’t include “catheter” just because it includes “cat”.

mask = [any(i in words for i in b['dirty_words'].values) 
        for words in a['text'].str.split().values]

print(a[mask])

                          text
0  the cat jumped over the hat
2     i lost my dog in the fog
Answered By: jpp

I think you can use isin after str.split

a[pd.DataFrame(a.text.str.split().tolist()).isin(b.dirty_words.tolist()).any(1)]
Out[380]: 
                          text
0  the cat jumped over the hat
2     i lost my dog in the fog
Answered By: BENY
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.