How to select rows in a dataframe based on a string customers list?
Question:
I have a first dataframe containing bills and in this dataframe a column named contents
contains customers names inside a no formatted / standardized string, like this :
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
1 863 01/01/23 VIKTOR 96$
2 870 01/01/23 Regard to ALEX 13$
3 871 07/01/23 MARK 01* 96$
4 872 07/01/23 To charge SAMANTHA every Thursday 96$
5 880 08/01/23 VIKTOR LECOMTE 13$
6 881 08/01/23 **** 13$
I have a second dataframe consisting of a short list of customers names, like this :
NUMBIL
0 VIKTOR
1 ALEX
2 SAMANTHA
What I would like to do
Based on customers list identify rows in first dataframe that do not contain customers names in CONTENTS
column.
In our case resulting dataframe would be :
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
3 871 07/01/23 MARK 01* 96$
6 881 08/01/23 **** 13$
I have already found a possible solution to my problem, but I think this topic could be useful to the community, and I would like to know the uniqueness way you would handle this ?
Dataframe to start with
import pandas as pd
fct = pd.DataFrame({'NUMBIL':[858, 863, 870, 871, 872, 880, 881],
'DATE':['01/01/23', '01/01/23', '01/01/23', '07/01/23', '07/01/23', '08/01/23', '08/01/23'],
'CONTENTS':['Billed to HENRY', 'VIKTOR', 'Regard to ALEX', 'MARK 01*',
'To charge SAMANTHA every Thursday', 'VIKTOR LECOMTE', '****'],
'AMOUNT':['25$', '96$', '13$', '96$', '96$', '13$', '13$'],
})
cust = pd.DataFrame({'CUSTOMERS':['VIKTOR', 'ALEX', 'SAMANTHA'],
})
Answers:
You can craft a regex for str.contains
and invert its output (~
) for boolean indexing:
import re
pattern = '|'.join(map(re.escape, cust['CUSTOMERS']))
out = fct[~fct['CONTENTS'].str.contains(pattern)]
If you only want to match full words (e.g., SAM
wouldn’t match SAMANTHA
), add word boundaries (b
):
out = fct[~fct['CONTENTS'].str.contains(fr'b(?:{pattern})b')]
Output:
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
3 871 07/01/23 MARK 01* 96$
6 881 08/01/23 **** 13$
Set theory
This would be much faster when you have large number of customers to test against
s = set(cust['CUSTOMERS'])
fct[fct['CONTENTS'].map(lambda c: s.isdisjoint(c.split()))]
Result
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
3 871 07/01/23 MARK 01* 96$
6 881 08/01/23 **** 13$
Solution I have found for illustative purpose.
Very similar to Mozway one but not taking into account the SAM / SAMANTHA problem exposed.
import pandas as pd
fct = pd.DataFrame({'NUMBIL':[858, 863, 870, 871, 872, 880, 881],
'DATE':['01/01/23', '01/01/23', '01/01/23', '07/01/23', '07/01/23', '08/01/23', '08/01/23'],
'CONTENTS':['Billed to HENRY', 'VIKTOR', 'Regard to ALEX', 'MARK 01*',
'To charge SAMANTHA every Thursday', 'VIKTOR LECOMTE', '****'],
'AMOUNT':['25$', '96$', '13$', '96$', '96$', '13$', '13$'],
})
cust = pd.DataFrame({'CUSTOMERS':['VIKTOR', 'ALEX', 'SAMANTHA'],
})
m = fct['CONTENTS'].str.contains('|'.join(cust['CUSTOMERS']))
r = fct[~m]
print(r)
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
3 871 07/01/23 MARK 01* 96$
6 881 08/01/23 **** 13$
I have a first dataframe containing bills and in this dataframe a column named contents
contains customers names inside a no formatted / standardized string, like this :
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
1 863 01/01/23 VIKTOR 96$
2 870 01/01/23 Regard to ALEX 13$
3 871 07/01/23 MARK 01* 96$
4 872 07/01/23 To charge SAMANTHA every Thursday 96$
5 880 08/01/23 VIKTOR LECOMTE 13$
6 881 08/01/23 **** 13$
I have a second dataframe consisting of a short list of customers names, like this :
NUMBIL
0 VIKTOR
1 ALEX
2 SAMANTHA
What I would like to do
Based on customers list identify rows in first dataframe that do not contain customers names in CONTENTS
column.
In our case resulting dataframe would be :
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
3 871 07/01/23 MARK 01* 96$
6 881 08/01/23 **** 13$
I have already found a possible solution to my problem, but I think this topic could be useful to the community, and I would like to know the uniqueness way you would handle this ?
Dataframe to start with
import pandas as pd
fct = pd.DataFrame({'NUMBIL':[858, 863, 870, 871, 872, 880, 881],
'DATE':['01/01/23', '01/01/23', '01/01/23', '07/01/23', '07/01/23', '08/01/23', '08/01/23'],
'CONTENTS':['Billed to HENRY', 'VIKTOR', 'Regard to ALEX', 'MARK 01*',
'To charge SAMANTHA every Thursday', 'VIKTOR LECOMTE', '****'],
'AMOUNT':['25$', '96$', '13$', '96$', '96$', '13$', '13$'],
})
cust = pd.DataFrame({'CUSTOMERS':['VIKTOR', 'ALEX', 'SAMANTHA'],
})
You can craft a regex for str.contains
and invert its output (~
) for boolean indexing:
import re
pattern = '|'.join(map(re.escape, cust['CUSTOMERS']))
out = fct[~fct['CONTENTS'].str.contains(pattern)]
If you only want to match full words (e.g., SAM
wouldn’t match SAMANTHA
), add word boundaries (b
):
out = fct[~fct['CONTENTS'].str.contains(fr'b(?:{pattern})b')]
Output:
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
3 871 07/01/23 MARK 01* 96$
6 881 08/01/23 **** 13$
Set theory
This would be much faster when you have large number of customers to test against
s = set(cust['CUSTOMERS'])
fct[fct['CONTENTS'].map(lambda c: s.isdisjoint(c.split()))]
Result
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
3 871 07/01/23 MARK 01* 96$
6 881 08/01/23 **** 13$
Solution I have found for illustative purpose.
Very similar to Mozway one but not taking into account the SAM / SAMANTHA problem exposed.
import pandas as pd
fct = pd.DataFrame({'NUMBIL':[858, 863, 870, 871, 872, 880, 881],
'DATE':['01/01/23', '01/01/23', '01/01/23', '07/01/23', '07/01/23', '08/01/23', '08/01/23'],
'CONTENTS':['Billed to HENRY', 'VIKTOR', 'Regard to ALEX', 'MARK 01*',
'To charge SAMANTHA every Thursday', 'VIKTOR LECOMTE', '****'],
'AMOUNT':['25$', '96$', '13$', '96$', '96$', '13$', '13$'],
})
cust = pd.DataFrame({'CUSTOMERS':['VIKTOR', 'ALEX', 'SAMANTHA'],
})
m = fct['CONTENTS'].str.contains('|'.join(cust['CUSTOMERS']))
r = fct[~m]
print(r)
NUMBIL DATE CONTENTS AMOUNT
0 858 01/01/23 Billed to HENRY 25$
3 871 07/01/23 MARK 01* 96$
6 881 08/01/23 **** 13$