filter a DataFrame using complete word only

Question:

I have a large DataFrame (named: complete) of data (only two columns). I want to filter it using complete words only, instead of substrings. Ex:

complete dataframe:

comment sentiment
fast running 0.9
heavily raining 0.5
in the house 0.1
coming in 0.0
rubbing it -0.5

if I set a substring to filter my table:

substring = 'in'
comp = complete[complete.apply(lambda row: row.astype(str).str.contains(substring, case=False).any(), axis=1)]

output comp:

comment sentiment
fast running 0.9
heavily raining 0.5
in the house 0.1
coming in 0.0
rubbing it -0.5

It returns the same DF since all words do have "in" as a substring.

My desired output:

comment sentiment
in the house 0.1
coming in 0.0

Filter it only if the substring is found as a word, not as a substring.

¿How can this be done?

Asked By: Alexandro Giles

||

Answers:

Use word boundaries (b) in your pattern:

substring = 'in'
comp = complete[complete['comment'].str.contains(fr'b{substring}b', case=False)]

If you have many columns to check, aggregate after (this is much more efficient than apply with axis=1):

comp = complete[complete[list_of_cols].apply(lambda s: s.str.contains(fr'b{substring}b', case=False)).any(axis=1)]
Answered By: mozway

Another possible solution, which uses lookaround and w means word character and !w means no word character. See demonstration to understand the logic.

sub = 'in'
df.loc[df.comment.str.contains(fr'(?<!w){sub}(?!w)'), :]

Output:

        comment  sentiment
2  in the house        0.1
3     coming in        0.0
Answered By: PaulS