Find Multiple Column Combination in Another Pandas Dataframe with Wildcards
Question:
I am trying to perform a daily reconciliation which will check a master dataset in a pandas dataframe, for the existence various combinations in another dataset (the reconciliation key), which contains wildcards.
Here is an illustrative representation of the issue (not using real information from my rec):
Master Dataset
FirstName
LastName
Occupation
Gender
Angela
Norris
Firefighter
Female
Angela
Thompson
Teacher
Female
Ben
Johnson
Police Officer
Male
Ben
Peterson
Solicitor
Male
Charlie
Davies
Nurse
Male
Debbie
Smith
Lawyer
Female
Reconciliation Key
FirstName
LastName
Occupation
Gender
Angela
*
*
*
Ben
Johnson
*
Male
Debbie
*
*
Female
So what I am trying to do here is find in the master dataset:
- Any records where FirstName = Angela (with three wildcards for LastName, Occupation, Gender)
- Any records where FirstName = Ben, LastName = Johnson and Gender = Male (with one wildcard on Occupation)
- Any records where FirstName = Debbie and Gender = Female (with two wildcards for LastName and Occupation)
I cannot work out how to do this task… I would normally create a unique key with the combinations of FirstName, LastName, Occupation and Gender, and then merge the two datasets, but with wildcards, this isn’t working for me.
I also tried a cartesian product (which I could later filter), but the size of the dataset I am working with resulted in a memory issue, so that wasn’t working either.
Ideally, the output of the check would give this table (i.e. showing rows in the master dataset that meet the criteria in the reconciliation key):
FirstName
LastName
Occupation
Gender
Angela
Norris
Firefighter
Female
Angela
Thompson
Teacher
Female
Ben
Johnson
Police Officer
Male
Debbie
Smith
Lawyer
Female
Answers:
It sounds like perhaps this particular type of task could be better suited to a "database".
Using duckdb as an example as it can also read from dataframes:
# probably better to write out the full query
query = "nandn".join(
f"(df2.{column} = '*' or df1.{column} = df2.{column})"
for column in df2.columns
)
duckdb.sql(f"""
from df1, df2
select df1.*
where
{query}
""")
(although it may make sense to use it directly as you mention large amounts of data)
┌───────────┬──────────┬────────────────┬─────────┐
│ FirstName │ LastName │ Occupation │ Gender │
│ varchar │ varchar │ varchar │ varchar │
├───────────┼──────────┼────────────────┼─────────┤
│ Angela │ Norris │ Firefighter │ Female │
│ Angela │ Thompson │ Teacher │ Female │
│ Ben │ Johnson │ Police Officer │ Male │
│ Debbie │ Smith │ Lawyer │ Female │
└───────────┴──────────┴────────────────┴─────────┘
code example used:
import duckdb
import pandas as pd
df1 = pd.read_csv(io.BytesIO(b"""
FirstName,LastName,Occupation,Gender
Angela,Norris,Firefighter,Female
Angela,Thompson,Teacher,Female
Ben,Johnson,Police Officer,Male
Ben,Peterson,Solicitor,Male
Charlie,Davies,Nurse,Male
Debbie,Smith,Lawyer,Female"""))
df2 = pd.read_csv(io.BytesIO(b"""
FirstName,LastName,Occupation,Gender
Angela,*,*,*
Ben,Johnson,*,Male
Debbie,*,*,Female"""))
query = "nandn".join(
f"(df2.{column} = '*' or df1.{column} = df2.{column})"
for column in df2.columns
)
duckdb.sql(f"""
from df1, df2
select df1.*
where
{query}
""")
Assuming FirstName is always present in the reconciliation key, this should do the trick:
Setup:
import numpy as np
import pandas as pd
master_df = pd.DataFrame({
'FirstName': ['Angela', 'Angela', 'Ben', 'Ben', 'Charlie', 'Debbie'],
'LastName': ['N', 'T', 'J', 'P', 'D', 'S'],
'Occupation': ['Firefighter', 'Teacher', 'Police', 'Solicitor', 'Nurse', 'Lawyer'],
'Gender': ['Female', 'Female', 'Male', 'Male', 'Male', 'Female'],
})
recon_df = pd.DataFrame({
'FirstName': ['Angela', 'Ben', 'Debbie'],
'LastName': ['*', 'J', '*'],
'Occupation': ['*', '*', '*'],
'Gender': ['*', 'Male', 'Female'],
})
Solution:
temp_df = master_df.merge(
recon_df,
how='left',
on='FirstName',
)
update_cols = [col for col in temp_df.columns if col[-2:] == '_x']
for col in update_cols:
y_col = col[:-2] + '_y'
temp_df[col] = np.where(
temp_df[y_col] == '*',
temp_df[y_col],
temp_df[col],
)
temp_df = temp_df.rename(columns={col: col[:-2] for col in update_cols})
temp_df = temp_df.drop(columns=[col for col in temp_df.columns if col[-2:] == '_y'])
keys = list(recon_df.columns.values)
index1 = temp_df.set_index(keys).index
index2 = recon_df.set_index(keys).index
output_df = master_df.loc[index1.isin(index2)]
Output:
FirstName LastName Occupation Gender
0 Angela N Firefighter Female
1 Angela T Teacher Female
2 Ben J Police Male
5 Debbie S Lawyer Female
(Built on solution here)
It seems as though I was able to resolve this by defining a function that checked whether the row contained LastName = Wildcard or LastName = LastName, and Occupation = Wildcard or Occupation = Occupation.
This is my code:
Setup:
import numpy as np
import pandas as pd
master_df = pd.DataFrame({
'FirstName': ['Angela', 'Angela', 'Ben', 'Ben', 'Charlie', 'Debbie'],
'LastName': ['N', 'T', 'J', 'P', 'D', 'S'],
'Occupation': ['Firefighter', 'Teacher', 'Police', 'Solicitor', 'Nurse', 'Lawyer'],
'Gender': ['Female', 'Female', 'Male', 'Male', 'Male', 'Female'],
})
recon_df = pd.DataFrame({
'FirstName': ['Angela', 'Ben', 'Debbie'],
'LastName': ['*', 'J', '*'],
'Occupation': ['*', '*', '*'],
'Gender': ['*', 'Male', 'Female'],
})
Solution:
def matches_wildcard(row):
recon_row = recon_df.loc[(recon_df['FirstName'] == row['FirstName']) &
((recon_df['LastName'] == '*') | (recon_df['LastName'] == row['LastName'])) &
((recon_df['Occupation'] == '*') | (recon_df['Occupation'] == row['Occupation'])) &
((recon_df['Gender'] == '*') | (recon_df['Gender'] == row['Gender']))]
return not recon_row.empty
matched_df = master_df[master_df.apply(matches_wildcard, axis=1)]
print(matched_df)
Output:
FirstName
LastName
Occupation
Gender
Angela
N
Firefighter
Female
Ben
J
Police
Male
Debbie
S
Lawyer
Female
I am trying to perform a daily reconciliation which will check a master dataset in a pandas dataframe, for the existence various combinations in another dataset (the reconciliation key), which contains wildcards.
Here is an illustrative representation of the issue (not using real information from my rec):
Master Dataset
FirstName | LastName | Occupation | Gender |
---|---|---|---|
Angela | Norris | Firefighter | Female |
Angela | Thompson | Teacher | Female |
Ben | Johnson | Police Officer | Male |
Ben | Peterson | Solicitor | Male |
Charlie | Davies | Nurse | Male |
Debbie | Smith | Lawyer | Female |
Reconciliation Key
FirstName | LastName | Occupation | Gender |
---|---|---|---|
Angela | * | * | * |
Ben | Johnson | * | Male |
Debbie | * | * | Female |
So what I am trying to do here is find in the master dataset:
- Any records where FirstName = Angela (with three wildcards for LastName, Occupation, Gender)
- Any records where FirstName = Ben, LastName = Johnson and Gender = Male (with one wildcard on Occupation)
- Any records where FirstName = Debbie and Gender = Female (with two wildcards for LastName and Occupation)
I cannot work out how to do this task… I would normally create a unique key with the combinations of FirstName, LastName, Occupation and Gender, and then merge the two datasets, but with wildcards, this isn’t working for me.
I also tried a cartesian product (which I could later filter), but the size of the dataset I am working with resulted in a memory issue, so that wasn’t working either.
Ideally, the output of the check would give this table (i.e. showing rows in the master dataset that meet the criteria in the reconciliation key):
FirstName | LastName | Occupation | Gender |
---|---|---|---|
Angela | Norris | Firefighter | Female |
Angela | Thompson | Teacher | Female |
Ben | Johnson | Police Officer | Male |
Debbie | Smith | Lawyer | Female |
It sounds like perhaps this particular type of task could be better suited to a "database".
Using duckdb as an example as it can also read from dataframes:
# probably better to write out the full query
query = "nandn".join(
f"(df2.{column} = '*' or df1.{column} = df2.{column})"
for column in df2.columns
)
duckdb.sql(f"""
from df1, df2
select df1.*
where
{query}
""")
(although it may make sense to use it directly as you mention large amounts of data)
┌───────────┬──────────┬────────────────┬─────────┐
│ FirstName │ LastName │ Occupation │ Gender │
│ varchar │ varchar │ varchar │ varchar │
├───────────┼──────────┼────────────────┼─────────┤
│ Angela │ Norris │ Firefighter │ Female │
│ Angela │ Thompson │ Teacher │ Female │
│ Ben │ Johnson │ Police Officer │ Male │
│ Debbie │ Smith │ Lawyer │ Female │
└───────────┴──────────┴────────────────┴─────────┘
code example used:
import duckdb
import pandas as pd
df1 = pd.read_csv(io.BytesIO(b"""
FirstName,LastName,Occupation,Gender
Angela,Norris,Firefighter,Female
Angela,Thompson,Teacher,Female
Ben,Johnson,Police Officer,Male
Ben,Peterson,Solicitor,Male
Charlie,Davies,Nurse,Male
Debbie,Smith,Lawyer,Female"""))
df2 = pd.read_csv(io.BytesIO(b"""
FirstName,LastName,Occupation,Gender
Angela,*,*,*
Ben,Johnson,*,Male
Debbie,*,*,Female"""))
query = "nandn".join(
f"(df2.{column} = '*' or df1.{column} = df2.{column})"
for column in df2.columns
)
duckdb.sql(f"""
from df1, df2
select df1.*
where
{query}
""")
Assuming FirstName is always present in the reconciliation key, this should do the trick:
Setup:
import numpy as np
import pandas as pd
master_df = pd.DataFrame({
'FirstName': ['Angela', 'Angela', 'Ben', 'Ben', 'Charlie', 'Debbie'],
'LastName': ['N', 'T', 'J', 'P', 'D', 'S'],
'Occupation': ['Firefighter', 'Teacher', 'Police', 'Solicitor', 'Nurse', 'Lawyer'],
'Gender': ['Female', 'Female', 'Male', 'Male', 'Male', 'Female'],
})
recon_df = pd.DataFrame({
'FirstName': ['Angela', 'Ben', 'Debbie'],
'LastName': ['*', 'J', '*'],
'Occupation': ['*', '*', '*'],
'Gender': ['*', 'Male', 'Female'],
})
Solution:
temp_df = master_df.merge(
recon_df,
how='left',
on='FirstName',
)
update_cols = [col for col in temp_df.columns if col[-2:] == '_x']
for col in update_cols:
y_col = col[:-2] + '_y'
temp_df[col] = np.where(
temp_df[y_col] == '*',
temp_df[y_col],
temp_df[col],
)
temp_df = temp_df.rename(columns={col: col[:-2] for col in update_cols})
temp_df = temp_df.drop(columns=[col for col in temp_df.columns if col[-2:] == '_y'])
keys = list(recon_df.columns.values)
index1 = temp_df.set_index(keys).index
index2 = recon_df.set_index(keys).index
output_df = master_df.loc[index1.isin(index2)]
Output:
FirstName LastName Occupation Gender
0 Angela N Firefighter Female
1 Angela T Teacher Female
2 Ben J Police Male
5 Debbie S Lawyer Female
(Built on solution here)
It seems as though I was able to resolve this by defining a function that checked whether the row contained LastName = Wildcard or LastName = LastName, and Occupation = Wildcard or Occupation = Occupation.
This is my code:
Setup:
import numpy as np
import pandas as pd
master_df = pd.DataFrame({
'FirstName': ['Angela', 'Angela', 'Ben', 'Ben', 'Charlie', 'Debbie'],
'LastName': ['N', 'T', 'J', 'P', 'D', 'S'],
'Occupation': ['Firefighter', 'Teacher', 'Police', 'Solicitor', 'Nurse', 'Lawyer'],
'Gender': ['Female', 'Female', 'Male', 'Male', 'Male', 'Female'],
})
recon_df = pd.DataFrame({
'FirstName': ['Angela', 'Ben', 'Debbie'],
'LastName': ['*', 'J', '*'],
'Occupation': ['*', '*', '*'],
'Gender': ['*', 'Male', 'Female'],
})
Solution:
def matches_wildcard(row):
recon_row = recon_df.loc[(recon_df['FirstName'] == row['FirstName']) &
((recon_df['LastName'] == '*') | (recon_df['LastName'] == row['LastName'])) &
((recon_df['Occupation'] == '*') | (recon_df['Occupation'] == row['Occupation'])) &
((recon_df['Gender'] == '*') | (recon_df['Gender'] == row['Gender']))]
return not recon_row.empty
matched_df = master_df[master_df.apply(matches_wildcard, axis=1)]
print(matched_df)
Output:
FirstName | LastName | Occupation | Gender |
---|---|---|---|
Angela | N | Firefighter | Female |
Ben | J | Police | Male |
Debbie | S | Lawyer | Female |