Python String not matching when using DataFrame str.contains regex – dollar sign issue?

Question:

In the below example code, I want any row where the "description" column contains any string from drop_transactions to be True in my resulting mask. As far as I can tell, both of the rows in my DataFrame should come back as True. But they don’t.

import pandas as pd

drop_transactions = ['CRCARDPMT', 'ONLINE PMT SMART',
                         '$TRANSFER DUMB BANK']
d = pd.DataFrame(
    data={'description':
          ['ONLINE PMT SMART ID94991 Internet Initiated Transaction-',
           '$TRANSFER DUMB BANK ID321 Internet Initiated Transaction-']}) 
drop_mask = d['description'].str.contains('|'.join(drop_transactions))

drop_mask
0     True
1    False  # I want this string to also be True
Name: description, dtype: bool

Suspecting the dollar sign as a culprit, if I add a dollar sign to the appropriate places, the first row also comes back False:

drop_transactions = ['CRCARDPMT', '$ONLINE PMT SMART',  # Note added dollar
                         '$TRANSFER DUMB BANK']
d = pd.DataFrame(
    data={'description':
          ['$ONLINE PMT SMART ID94991 Internet Initiated Transaction-',  # Note added dollar
           '$TRANSFER DUMB BANK ID321 Internet Initiated Transaction-']})
drop_mask = d['description'].str.contains('|'.join(drop_transactions))
drop_mask 
0    False
1    False
Name: description, dtype: bool

I’m not super well-versed in Regex, but can anyone help me understand what’s going on here? I recognize I could change my match string to not look for the dollar sign, but I’d like to understand why this is happening to be sure I’m not encountering any future bugs.

Asked By: Julian Drago

||

Answers:

You can use re.escape to escape the special regex character $:

import re

drop_mask = d["description"].str.contains(
    "|".join(map(re.escape, drop_transactions))
)

print(drop_mask)

Prints:

0    True
1    True
Name: description, dtype: bool
Answered By: Andrej Kesely
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.