Match two pandas dataframe depending on multiple conditions
Question:
I have two datasets (df_persons and df_database). Both of them have the same structure:
cu_id
sex
eye_colour
favourite_sport
cash_on_account
1
m
blue
soccer
15
2
f
green
tennis
25
3
m
brown
ski
33
(much more rows with various combinations of sex, eye_colour and favourite_sport)
For each individuals of the rows/cu_ids in persons, I’m looking for a similar match in database.
There are certain rules to follow:
Loop through every row/cu_id from persons
- 1: search in database for a row with the same values for sex, eye_colour, favourite_sport
- if there is exactly one row, simply return cu_id
- if there is more than one row, sort the results by cash_on_account and return cu_id from top row
- if there is no matching row, proceed with 2, else to 5
- 2: search in database for a row with the same values for sex and eye_colour (ignore favourite_sport!)
- if there is exactly one row, simply return cu_id
- if there is more than one row, sort the results by cash_on_account and return cu_id from top row
- if there is no matching row, proceed with 3, else to 5
- 3: search in database for a row with the same values for sex only (ignore eye_colour and favourite_sport)
- if there is exactly one row, simply return cu_id
- if there is more than one row, sort the results by cash_on_account and return cu_id from top row
- if there is no matching row, simly sort by cash_on_account and return cu_id from top row
- proceed to 5
- 5: Everytime a cu_id in persona got a "match", this "match" is not allowed to get used another time. Proceed to next row in persons
In other words, we are looking (in another table) for the most similar user for persons.
Every user from the database can only be used once.
The order of comparison is important (only if sex+eye_colour+favourite_sport are matching, it’s a match – otherwise only if sex+eye_colour or even just sex. Matching sex+favourite_sport is NO valid match).
import pandas as pd
database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
# Create the pandas DataFrame
database = pd.DataFrame(db, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(data, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
I simply can’t wrap my head around that problem, without using a for loop and extensive comparisons/filtering (especially because I’m just allowed to match every user from database just once).
Is there any guidance you could offer for problems like that?
Best regards,
worky
Answers:
Would merging these dataframes on your conditions in the given order and only checking non matched ones in the next merge work for you?
Like this:
database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
# Create the pandas DataFrame
database = pd.DataFrame(database, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(persons, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
m1 = persons.merge(database, how='inner', on=['sex', 'eye_colour', 'favourite_sport'])
.sort_values('cash_on_account_x', ascending=False)
.drop_duplicates(subset='cu_id_x', keep='first')
.reset_index(drop=True)
persons2 = persons[~persons['cu_id'].isin(m1.cu_id_x.unique())]
m2 = persons2.merge(database, how='inner', on=['sex', 'eye_colour'])
.sort_values('cash_on_account_x', ascending=False)
.drop_duplicates(subset='cu_id_x', keep='first')
.reset_index(drop=True)
persons3 = persons2[~persons2['cu_id'].isin(m2.cu_id_x.unique())]
m3 = persons3.merge(database, how='inner', on=['sex'])
.sort_values('cash_on_account_x', ascending=False)
.drop_duplicates(subset='cu_id_x', keep='first')
.reset_index(drop=True)
m1['match_type'] = 'match_type_one'
m2['match_type'] = 'match_type_two'
m3['match_type'] = 'match_type_three'
cols = ['cu_id_x', 'match_type']
final_df = pd.concat([m1[cols], m2[cols], m3[cols]])
I have two datasets (df_persons and df_database). Both of them have the same structure:
cu_id | sex | eye_colour | favourite_sport | cash_on_account |
---|---|---|---|---|
1 | m | blue | soccer | 15 |
2 | f | green | tennis | 25 |
3 | m | brown | ski | 33 |
(much more rows with various combinations of sex, eye_colour and favourite_sport)
For each individuals of the rows/cu_ids in persons, I’m looking for a similar match in database.
There are certain rules to follow:
Loop through every row/cu_id from persons
- 1: search in database for a row with the same values for sex, eye_colour, favourite_sport
- if there is exactly one row, simply return cu_id
- if there is more than one row, sort the results by cash_on_account and return cu_id from top row
- if there is no matching row, proceed with 2, else to 5
- 2: search in database for a row with the same values for sex and eye_colour (ignore favourite_sport!)
- if there is exactly one row, simply return cu_id
- if there is more than one row, sort the results by cash_on_account and return cu_id from top row
- if there is no matching row, proceed with 3, else to 5
- 3: search in database for a row with the same values for sex only (ignore eye_colour and favourite_sport)
- if there is exactly one row, simply return cu_id
- if there is more than one row, sort the results by cash_on_account and return cu_id from top row
- if there is no matching row, simly sort by cash_on_account and return cu_id from top row
- proceed to 5
- 5: Everytime a cu_id in persona got a "match", this "match" is not allowed to get used another time. Proceed to next row in persons
In other words, we are looking (in another table) for the most similar user for persons.
Every user from the database can only be used once.
The order of comparison is important (only if sex+eye_colour+favourite_sport are matching, it’s a match – otherwise only if sex+eye_colour or even just sex. Matching sex+favourite_sport is NO valid match).
import pandas as pd
database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
# Create the pandas DataFrame
database = pd.DataFrame(db, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(data, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
I simply can’t wrap my head around that problem, without using a for loop and extensive comparisons/filtering (especially because I’m just allowed to match every user from database just once).
Is there any guidance you could offer for problems like that?
Best regards,
worky
Would merging these dataframes on your conditions in the given order and only checking non matched ones in the next merge work for you?
Like this:
database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
# Create the pandas DataFrame
database = pd.DataFrame(database, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(persons, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
m1 = persons.merge(database, how='inner', on=['sex', 'eye_colour', 'favourite_sport'])
.sort_values('cash_on_account_x', ascending=False)
.drop_duplicates(subset='cu_id_x', keep='first')
.reset_index(drop=True)
persons2 = persons[~persons['cu_id'].isin(m1.cu_id_x.unique())]
m2 = persons2.merge(database, how='inner', on=['sex', 'eye_colour'])
.sort_values('cash_on_account_x', ascending=False)
.drop_duplicates(subset='cu_id_x', keep='first')
.reset_index(drop=True)
persons3 = persons2[~persons2['cu_id'].isin(m2.cu_id_x.unique())]
m3 = persons3.merge(database, how='inner', on=['sex'])
.sort_values('cash_on_account_x', ascending=False)
.drop_duplicates(subset='cu_id_x', keep='first')
.reset_index(drop=True)
m1['match_type'] = 'match_type_one'
m2['match_type'] = 'match_type_two'
m3['match_type'] = 'match_type_three'
cols = ['cu_id_x', 'match_type']
final_df = pd.concat([m1[cols], m2[cols], m3[cols]])