Filter pandas column by list of phrases

Question:

I have a string column of narratives. Each narrative is basically an essay. I want to take a subset of the df where certain phrases exist. The current method isn’t working as intended. I’m filtering rows that don’t contain the phrase exactly or just contains a subset of the phrase.

I’ve tried the following:

phrase = ['went to the store to buy an apple', 'corner of the street', 'fbi most wanted']
df['text'].str.contains(r'b{}b'.format('|'.join(phrase)), re.IGNORECASE, regex=True)

Not including an example because really just looking for a code review more than anything. The method above should look through the column text to see if those phrases exist, correct? Or am I missing something?

Asked By: chicagobeast12

||

Answers:

That won’t work because you did not group the alternatives.

To do this right, you also coud sort phrases by length in the descending order, but here, in contains, it is not important:

df['text'].str.contains(r'b(?:{})b'.format('|'.join(sorted(phrase, key=len, reverse=True))), case=False, regex=True)

I also recommend case=False instead of re.IGNORECASE.

A foolproof version:

df['text'].str.contains(r'(?!Bw)(?:{})(?!Bw)'.format('|'.join(sorted(map(re.escape, phrase), key=len, reverse=True))), case=False, regex=True)

where

  • phrases are escaped for use in regex
  • phrases are sorted
  • case=False ensures case insensitive matching
  • (?!Bw) defines adaptive word boundaries and ensure correct whole word match
  • (?:...) is a non-capturing group that groups patterns wthout capturing them (and causes no warnings in Series.str.contains).
Answered By: Wiktor Stribiżew
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.