Filter pandas column by list of phrases

Question

I have a string column of narratives. Each narrative is basically an essay. I want to take a subset of the df where certain phrases exist. The current method isn’t working as intended. I’m filtering rows that don’t contain the phrase exactly or just contains a subset of the phrase.

I’ve tried the following:

phrase = ['went to the store to buy an apple', 'corner of the street', 'fbi most wanted']
df['text'].str.contains(r'b{}b'.format('|'.join(phrase)), re.IGNORECASE, regex=True)

Not including an example because really just looking for a code review more than anything. The method above should look through the column text to see if those phrases exist, correct? Or am I missing something?

Asked By: chicagobeast12

||

Source

Answer 1

That won’t work because you did not group the alternatives.

To do this right, you also coud sort phrases by length in the descending order, but here, in contains, it is not important:

df['text'].str.contains(r'b(?:{})b'.format('|'.join(sorted(phrase, key=len, reverse=True))), case=False, regex=True)

I also recommend case=False instead of re.IGNORECASE.

A foolproof version:

df['text'].str.contains(r'(?!Bw)(?:{})(?!Bw)'.format('|'.join(sorted(map(re.escape, phrase), key=len, reverse=True))), case=False, regex=True)

where

phrases are escaped for use in regex
phrases are sorted
case=False ensures case insensitive matching
(?!Bw) defines adaptive word boundaries and ensure correct whole word match
(?:...) is a non-capturing group that groups patterns wthout capturing them (and causes no warnings in Series.str.contains).

Answered By: Wiktor Stribiżew

Filter pandas column by list of phrases

Question:

Answers: