Filter pandas column by list of phrases
Question:
I have a string column of narratives. Each narrative is basically an essay. I want to take a subset of the df where certain phrases exist. The current method isn’t working as intended. I’m filtering rows that don’t contain the phrase exactly or just contains a subset of the phrase.
I’ve tried the following:
phrase = ['went to the store to buy an apple', 'corner of the street', 'fbi most wanted']
df['text'].str.contains(r'b{}b'.format('|'.join(phrase)), re.IGNORECASE, regex=True)
Not including an example because really just looking for a code review more than anything. The method above should look through the column text to see if those phrases exist, correct? Or am I missing something?
Answers:
That won’t work because you did not group the alternatives.
To do this right, you also coud sort phrases by length in the descending order, but here, in contains
, it is not important:
df['text'].str.contains(r'b(?:{})b'.format('|'.join(sorted(phrase, key=len, reverse=True))), case=False, regex=True)
I also recommend case=False
instead of re.IGNORECASE
.
A foolproof version:
df['text'].str.contains(r'(?!Bw)(?:{})(?!Bw)'.format('|'.join(sorted(map(re.escape, phrase), key=len, reverse=True))), case=False, regex=True)
where
phrases
are escaped for use in regex
phrases
are sorted
case=False
ensures case insensitive matching
(?!Bw)
defines adaptive word boundaries and ensure correct whole word match
(?:...)
is a non-capturing group that groups patterns wthout capturing them (and causes no warnings in Series.str.contains
).
I have a string column of narratives. Each narrative is basically an essay. I want to take a subset of the df where certain phrases exist. The current method isn’t working as intended. I’m filtering rows that don’t contain the phrase exactly or just contains a subset of the phrase.
I’ve tried the following:
phrase = ['went to the store to buy an apple', 'corner of the street', 'fbi most wanted']
df['text'].str.contains(r'b{}b'.format('|'.join(phrase)), re.IGNORECASE, regex=True)
Not including an example because really just looking for a code review more than anything. The method above should look through the column text to see if those phrases exist, correct? Or am I missing something?
That won’t work because you did not group the alternatives.
To do this right, you also coud sort phrases by length in the descending order, but here, in contains
, it is not important:
df['text'].str.contains(r'b(?:{})b'.format('|'.join(sorted(phrase, key=len, reverse=True))), case=False, regex=True)
I also recommend case=False
instead of re.IGNORECASE
.
A foolproof version:
df['text'].str.contains(r'(?!Bw)(?:{})(?!Bw)'.format('|'.join(sorted(map(re.escape, phrase), key=len, reverse=True))), case=False, regex=True)
where
phrases
are escaped for use in regexphrases
are sortedcase=False
ensures case insensitive matching(?!Bw)
defines adaptive word boundaries and ensure correct whole word match(?:...)
is a non-capturing group that groups patterns wthout capturing them (and causes no warnings inSeries.str.contains
).