Match two pandas dataframe depending on multiple conditions

Question

I have two datasets (df_persons and df_database). Both of them have the same structure:

cu_id	sex	eye_colour	favourite_sport	cash_on_account
1	m	blue	soccer	15
2	f	green	tennis	25
3	m	brown	ski	33

(much more rows with various combinations of sex, eye_colour and favourite_sport)

For each individuals of the rows/cu_ids in persons, I’m looking for a similar match in database.

There are certain rules to follow:

Loop through every row/cu_id from persons

1: search in database for a row with the same values for sex, eye_colour, favourite_sport
- if there is exactly one row, simply return cu_id
- if there is more than one row, sort the results by cash_on_account and return cu_id from top row
- if there is no matching row, proceed with 2, else to 5
2: search in database for a row with the same values for sex and eye_colour (ignore favourite_sport!)
- if there is exactly one row, simply return cu_id
- if there is more than one row, sort the results by cash_on_account and return cu_id from top row
- if there is no matching row, proceed with 3, else to 5
3: search in database for a row with the same values for sex only (ignore eye_colour and favourite_sport)
- if there is exactly one row, simply return cu_id
- if there is more than one row, sort the results by cash_on_account and return cu_id from top row
- if there is no matching row, simly sort by cash_on_account and return cu_id from top row
- proceed to 5
5: Everytime a cu_id in persona got a "match", this "match" is not allowed to get used another time. Proceed to next row in persons

In other words, we are looking (in another table) for the most similar user for persons.
Every user from the database can only be used once.
The order of comparison is important (only if sex+eye_colour+favourite_sport are matching, it’s a match – otherwise only if sex+eye_colour or even just sex. Matching sex+favourite_sport is NO valid match).

import pandas as pd
 
database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
 
# Create the pandas DataFrame
database = pd.DataFrame(db, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(data, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])

I simply can’t wrap my head around that problem, without using a for loop and extensive comparisons/filtering (especially because I’m just allowed to match every user from database just once).

Is there any guidance you could offer for problems like that?

Best regards,
worky

Asked By: workah0lic

||

Source

Answer 1

Would merging these dataframes on your conditions in the given order and only checking non matched ones in the next merge work for you?
Like this:

database = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14], ['4', 'm', 'blue', 'soccer', 10], ['5', 'm', 'green', 'tennis', 15], ['6', 'f', 'brown', 'ski', 14], ['7', 'm', 'blue', 'soccer', 10], ['8', 'f', 'green', 'tennis', 15], ['9', 'm', 'brown', 'ski', 14], ['10', 'f', 'blue', 'soccer', 10], ['11', 'm', 'green', 'tennis', 15], ['12', 'm', 'brown', 'ski', 14], ['13', 'f', 'blue', 'tennis', 10], ['14', 'm', 'green', 'ski', 15], ['15', 'f', 'green', 'soccer', 14]]
persons = [['1', 'm', 'blue', 'soccer', 10], ['2', 'm', 'green', 'tennis', 15], ['3', 'f', 'brown', 'ski', 14]]
 
# Create the pandas DataFrame
database = pd.DataFrame(database, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])
persons = pd.DataFrame(persons, columns=['cu_id', 'sex', 'eye_colour', 'favourite_sport', 'cash_on_account'])

m1 = persons.merge(database, how='inner', on=['sex', 'eye_colour', 'favourite_sport']) 
    .sort_values('cash_on_account_x', ascending=False) 
    .drop_duplicates(subset='cu_id_x', keep='first') 
    .reset_index(drop=True)

persons2 = persons[~persons['cu_id'].isin(m1.cu_id_x.unique())]

m2 = persons2.merge(database, how='inner', on=['sex', 'eye_colour']) 
    .sort_values('cash_on_account_x', ascending=False) 
    .drop_duplicates(subset='cu_id_x', keep='first') 
    .reset_index(drop=True)

persons3 = persons2[~persons2['cu_id'].isin(m2.cu_id_x.unique())]

m3 = persons3.merge(database, how='inner', on=['sex']) 
    .sort_values('cash_on_account_x', ascending=False) 
    .drop_duplicates(subset='cu_id_x', keep='first') 
    .reset_index(drop=True)

m1['match_type'] = 'match_type_one'
m2['match_type'] = 'match_type_two'
m3['match_type'] = 'match_type_three'

cols = ['cu_id_x', 'match_type']

final_df = pd.concat([m1[cols], m2[cols], m3[cols]])

Answered By: Onur Guven

Match two pandas dataframe depending on multiple conditions

Question:

Answers: