How to select rows in a dataframe based on a string customers list?

Question:

I have a first dataframe containing bills and in this dataframe a column named contents contains customers names inside a no formatted / standardized string, like this :

   NUMBIL      DATE                           CONTENTS AMOUNT
0     858  01/01/23                    Billed to HENRY    25$
1     863  01/01/23                             VIKTOR    96$
2     870  01/01/23                     Regard to ALEX    13$
3     871  07/01/23                           MARK 01*    96$
4     872  07/01/23  To charge SAMANTHA every Thursday    96$
5     880  08/01/23                     VIKTOR LECOMTE    13$
6     881  08/01/23                               ****    13$

I have a second dataframe consisting of a short list of customers names, like this :

     NUMBIL
0    VIKTOR
1      ALEX
2  SAMANTHA

What I would like to do

Based on customers list identify rows in first dataframe that do not contain customers names in CONTENTS column.

In our case resulting dataframe would be :

   NUMBIL      DATE         CONTENTS AMOUNT
0     858  01/01/23  Billed to HENRY    25$
3     871  07/01/23         MARK 01*    96$
6     881  08/01/23             ****    13$

I have already found a possible solution to my problem, but I think this topic could be useful to the community, and I would like to know the uniqueness way you would handle this ?

Dataframe to start with

import pandas as pd

fct = pd.DataFrame({'NUMBIL':[858, 863, 870, 871, 872, 880, 881],
                   'DATE':['01/01/23', '01/01/23', '01/01/23', '07/01/23', '07/01/23', '08/01/23', '08/01/23'],
                   'CONTENTS':['Billed to HENRY', 'VIKTOR', 'Regard to ALEX', 'MARK 01*', 
                               'To charge SAMANTHA every Thursday', 'VIKTOR LECOMTE', '****'],
                   'AMOUNT':['25$', '96$', '13$', '96$', '96$', '13$', '13$'],
                   })

cust = pd.DataFrame({'CUSTOMERS':['VIKTOR', 'ALEX', 'SAMANTHA'],
                   })
Asked By: Laurent B.

||

Answers:

You can craft a regex for str.contains and invert its output (~) for boolean indexing:

import re

pattern = '|'.join(map(re.escape, cust['CUSTOMERS']))

out = fct[~fct['CONTENTS'].str.contains(pattern)]

If you only want to match full words (e.g., SAM wouldn’t match SAMANTHA), add word boundaries (b):

out = fct[~fct['CONTENTS'].str.contains(fr'b(?:{pattern})b')]

Output:

   NUMBIL      DATE         CONTENTS AMOUNT
0     858  01/01/23  Billed to HENRY    25$
3     871  07/01/23         MARK 01*    96$
6     881  08/01/23             ****    13$
Answered By: mozway

Set theory

This would be much faster when you have large number of customers to test against

s = set(cust['CUSTOMERS'])
fct[fct['CONTENTS'].map(lambda c: s.isdisjoint(c.split()))]

Result

   NUMBIL      DATE         CONTENTS AMOUNT
0     858  01/01/23  Billed to HENRY    25$
3     871  07/01/23         MARK 01*    96$
6     881  08/01/23             ****    13$
Answered By: Shubham Sharma

Solution I have found for illustative purpose.

Very similar to Mozway one but not taking into account the SAM / SAMANTHA problem exposed.

import pandas as pd

fct = pd.DataFrame({'NUMBIL':[858, 863, 870, 871, 872, 880, 881],
                   'DATE':['01/01/23', '01/01/23', '01/01/23', '07/01/23', '07/01/23', '08/01/23', '08/01/23'],
                   'CONTENTS':['Billed to HENRY', 'VIKTOR', 'Regard to ALEX', 'MARK 01*', 
                               'To charge SAMANTHA every Thursday', 'VIKTOR LECOMTE', '****'],
                   'AMOUNT':['25$', '96$', '13$', '96$', '96$', '13$', '13$'],
                   })

cust = pd.DataFrame({'CUSTOMERS':['VIKTOR', 'ALEX', 'SAMANTHA'],
                   })

m = fct['CONTENTS'].str.contains('|'.join(cust['CUSTOMERS']))
r = fct[~m]

print(r)
   NUMBIL      DATE         CONTENTS AMOUNT
0     858  01/01/23  Billed to HENRY    25$
3     871  07/01/23         MARK 01*    96$
6     881  08/01/23             ****    13$
Answered By: Laurent B.
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.