Split and match names in a df with another df and return the matching firms
Question:
Please help me with this particular scenario. I’ve been able to partially do this but the final dataframe does not look correct for all the rows.
I have two Dataframes:
df1:
Name
A/B/C
D/E
F/G
df2:
First Name
Last Name
Firm Names
Adam
A
Firm1
Harry
B
Firm1
Andrew
C
Firm1
Mike
A
Firm2
Sheila
D
Firm3
Hash
E
Firm3
Michelle
F
Firm4
Morty
G
Firm4
Now df1 contains only the last names with a slash(/). I want to match the last names in df1 with df2 and when it finds a common firm name for all of A and B and C for example then return that firm name for that row. If you notice A/B/C in df1 there are multiple firm names for the same last name in df2. I only want the common firm name for all the three last names in that row.
So my final data frame should look something like this :
Name
Firm Name
A/B/C
Firm1
D/E
Firm2
F/G
Firm3
Answers:
If ordering of Last Names
is same in both DataFrames is possible use:
out = df1[['Name']].merge((df1['Name'].str.split('/')
.explode()
.rename('Last Name')
.reset_index()
.merge(df2, how='left')
.groupby(['index','Firm Names'])
.agg(Name=('Last Name', '/'.join))
.reset_index(level=1)), how='left')
print (out)
Name Firm Names
0 A/B/C Firm1
1 D/E Firm3
2 F/G Firm4
More general solution is use frozenset
s for matching fimes with any ordering:
First create frozenset
s with splitted values to helper DataFrame
– df1
:
df11 = df1.assign(**{'Last Name':df1['Name'].str.split('/'),
'sets':lambda x: x['Last Name'].apply(frozenset)})
print (df11)
Name Last Name sets
0 A/B/C [A, B, C] (B, A, C)
1 D/E [D, E] (D, E)
2 F/G [F, G] (F, G)
Use DataFrame.explode
for column from lists and left join with second DataFrame by DataFrame.merge
, create set
s for each firm name
s:
df22 = (df11.explode('Last Name')
.reset_index()
.merge(df2, how='left')
.groupby(['index','Firm Names'])
.agg(sets=('Last Name', frozenset))
.reset_index(level=1))
print (df22)
Firm Names sets
index
0 Firm1 (B, A, C)
0 Firm2 (A)
1 Firm3 (D, E)
2 Firm4 (F, G)
Last left join to original df1
and filter columns names:
out = df11.merge(df22, how='left')[['Name','Firm Names']]
print (out)
Name Firm Names
0 A/B/C Firm1
1 D/E Firm3
2 F/G Firm4
import pandas as pd
df1 = pd.DataFrame({"Name":["A/B/C","D/E","F/G"]}).rename(columns= {"Name":"Last Name"})
df2 = pd.DataFrame({'First Name': ['Adam', 'Harry', 'Andrew', 'Mike', 'Sheila', 'Hash', 'Michelle', 'Morty'], 'Last Name': ['A', 'B', 'C', 'A', 'D', 'E', 'F', 'G'], 'Firm Names': ['Firm1', 'Firm1', 'Firm1', 'Firm2', 'Firm3', 'Firm3', 'Firm4', 'Firm4']})
unique_firms = df2["Firm Names"].unique()
splitted_names = [set(n.split("/")) for n in df1["Last Name"]]
firm_dict = {firm:set(df2[df2["Firm Names"] == firm]["Last Name"]) for firm in unique_firms}
data1 = [('/'.join(v),k) for k,v in firm_dict.items() if v in splitted_names]
df3 = pd.DataFrame(data1 , columns=["Name", "Firm Name"])
output:
Name
Firm Name
A/B/C
Firm1
D/E
Firm2
F/G
Firm3
If you need to keep the first names and all the firms as well use the following code:
data2 = [(firm,'/'.join(list(df2[df2["Firm Names"] == firm]["Last Name"])),'/'.join(list(df2[df2["Firm Names"] == firm]["First Name"]))) for firm in unique_firms]
df4 = pd.DataFrame(data2, columns=["Name", " Last Name", "Firm Name"])
output:
Firm Names
Last Name
First Name
Firm1
A/B/C
Adam/Harry/Andrew
Firm2
A
Mike
Firm3
D/E
Sheila/Hash
Firm4
F/G
Michelle/Morty
You can use a simple merge
with a key as frozenset
, no need to explode
:
out = df1.merge(df2.groupby(['Firm Names'], as_index=False)
['Last Name'].agg(frozenset),
left_on=df1['Name'].str.split('/').apply(frozenset),
right_on='Last Name'
).drop(columns='Last Name')
Output:
Name Firm Names
0 A/B/C Firm1
1 D/E Firm3
2 F/G Firm4
Handling other columns
If you have other columns and they are matching the Firm Names (i.e., a given First Name has a single Address), then just include those in the groupby
, if the values are different for a given Firm Name, you have to aggregate. Below is an example of both:
out = df1.merge(df2.groupby(['Firm Names', 'Address'], as_index=False)
.agg({'Last Name': frozenset, 'ID': ','.join}),
left_on=df1['Name'].str.split('/').apply(frozenset),
right_on='Last Name'
).drop(columns='Last Name')
Example:
Name Firm Names Address ID
0 A/B/C Firm1 ABC a,b,c
1 D/E Firm3 GHI e,f
2 F/G Firm4 JKL g,h
Modified df2
:
First Name Last Name Firm Names Address ID
0 Adam A Firm1 ABC a
1 Harry B Firm1 ABC b
2 Andrew C Firm1 ABC c
3 Mike A Firm2 DEF d
4 Sheila D Firm3 GHI e
5 Hash E Firm3 GHI f
6 Michelle F Firm4 JKL g
7 Morty G Firm4 JKL h
pre-filtering df2
:
valid_names = '/'.join(df1['Name']).split('/')
out = df1.merge(df2[df2['Last Name'].isin(valid_names)]
.groupby(['Firm Names', 'Address'], as_index=False)
.agg({'Last Name': frozenset}),
left_on=df1['Name'].str.split('/').apply(frozenset),
right_on='Last Name', how='left'
).drop(columns='Last Name')
Output:
Name Firm Names Address
0 A/B/C Firm1 MA
1 D/E Firm3 PS
Used input:
df1 = pd.DataFrame({'Name': ['A/B/C', 'D/E']})
df2 = pd.DataFrame({'First Name': ['Adam', 'Harry', 'Andrew', 'Mike', 'Sheila', 'Hash', 'ABC'],
'Last Name': ['A', 'B', 'C', 'B', 'D', 'E', 'XYZ'],
'Firm Names': ['Firm1', 'Firm1', 'Firm1', 'Firm2', 'Firm3', 'Firm3','Firm1'],
'Address':['MA', 'MA', 'MA', 'BO', 'PS', 'PS', 'MA']})
Please help me with this particular scenario. I’ve been able to partially do this but the final dataframe does not look correct for all the rows.
I have two Dataframes:
df1:
Name |
---|
A/B/C |
D/E |
F/G |
df2:
First Name | Last Name | Firm Names |
---|---|---|
Adam | A | Firm1 |
Harry | B | Firm1 |
Andrew | C | Firm1 |
Mike | A | Firm2 |
Sheila | D | Firm3 |
Hash | E | Firm3 |
Michelle | F | Firm4 |
Morty | G | Firm4 |
Now df1 contains only the last names with a slash(/). I want to match the last names in df1 with df2 and when it finds a common firm name for all of A and B and C for example then return that firm name for that row. If you notice A/B/C in df1 there are multiple firm names for the same last name in df2. I only want the common firm name for all the three last names in that row.
So my final data frame should look something like this :
Name | Firm Name |
---|---|
A/B/C | Firm1 |
D/E | Firm2 |
F/G | Firm3 |
If ordering of Last Names
is same in both DataFrames is possible use:
out = df1[['Name']].merge((df1['Name'].str.split('/')
.explode()
.rename('Last Name')
.reset_index()
.merge(df2, how='left')
.groupby(['index','Firm Names'])
.agg(Name=('Last Name', '/'.join))
.reset_index(level=1)), how='left')
print (out)
Name Firm Names
0 A/B/C Firm1
1 D/E Firm3
2 F/G Firm4
More general solution is use frozenset
s for matching fimes with any ordering:
First create frozenset
s with splitted values to helper DataFrame
– df1
:
df11 = df1.assign(**{'Last Name':df1['Name'].str.split('/'),
'sets':lambda x: x['Last Name'].apply(frozenset)})
print (df11)
Name Last Name sets
0 A/B/C [A, B, C] (B, A, C)
1 D/E [D, E] (D, E)
2 F/G [F, G] (F, G)
Use DataFrame.explode
for column from lists and left join with second DataFrame by DataFrame.merge
, create set
s for each firm name
s:
df22 = (df11.explode('Last Name')
.reset_index()
.merge(df2, how='left')
.groupby(['index','Firm Names'])
.agg(sets=('Last Name', frozenset))
.reset_index(level=1))
print (df22)
Firm Names sets
index
0 Firm1 (B, A, C)
0 Firm2 (A)
1 Firm3 (D, E)
2 Firm4 (F, G)
Last left join to original df1
and filter columns names:
out = df11.merge(df22, how='left')[['Name','Firm Names']]
print (out)
Name Firm Names
0 A/B/C Firm1
1 D/E Firm3
2 F/G Firm4
import pandas as pd
df1 = pd.DataFrame({"Name":["A/B/C","D/E","F/G"]}).rename(columns= {"Name":"Last Name"})
df2 = pd.DataFrame({'First Name': ['Adam', 'Harry', 'Andrew', 'Mike', 'Sheila', 'Hash', 'Michelle', 'Morty'], 'Last Name': ['A', 'B', 'C', 'A', 'D', 'E', 'F', 'G'], 'Firm Names': ['Firm1', 'Firm1', 'Firm1', 'Firm2', 'Firm3', 'Firm3', 'Firm4', 'Firm4']})
unique_firms = df2["Firm Names"].unique()
splitted_names = [set(n.split("/")) for n in df1["Last Name"]]
firm_dict = {firm:set(df2[df2["Firm Names"] == firm]["Last Name"]) for firm in unique_firms}
data1 = [('/'.join(v),k) for k,v in firm_dict.items() if v in splitted_names]
df3 = pd.DataFrame(data1 , columns=["Name", "Firm Name"])
output:
Name | Firm Name |
---|---|
A/B/C | Firm1 |
D/E | Firm2 |
F/G | Firm3 |
If you need to keep the first names and all the firms as well use the following code:
data2 = [(firm,'/'.join(list(df2[df2["Firm Names"] == firm]["Last Name"])),'/'.join(list(df2[df2["Firm Names"] == firm]["First Name"]))) for firm in unique_firms]
df4 = pd.DataFrame(data2, columns=["Name", " Last Name", "Firm Name"])
output:
Firm Names | Last Name | First Name |
---|---|---|
Firm1 | A/B/C | Adam/Harry/Andrew |
Firm2 | A | Mike |
Firm3 | D/E | Sheila/Hash |
Firm4 | F/G | Michelle/Morty |
You can use a simple merge
with a key as frozenset
, no need to explode
:
out = df1.merge(df2.groupby(['Firm Names'], as_index=False)
['Last Name'].agg(frozenset),
left_on=df1['Name'].str.split('/').apply(frozenset),
right_on='Last Name'
).drop(columns='Last Name')
Output:
Name Firm Names
0 A/B/C Firm1
1 D/E Firm3
2 F/G Firm4
Handling other columns
If you have other columns and they are matching the Firm Names (i.e., a given First Name has a single Address), then just include those in the groupby
, if the values are different for a given Firm Name, you have to aggregate. Below is an example of both:
out = df1.merge(df2.groupby(['Firm Names', 'Address'], as_index=False)
.agg({'Last Name': frozenset, 'ID': ','.join}),
left_on=df1['Name'].str.split('/').apply(frozenset),
right_on='Last Name'
).drop(columns='Last Name')
Example:
Name Firm Names Address ID
0 A/B/C Firm1 ABC a,b,c
1 D/E Firm3 GHI e,f
2 F/G Firm4 JKL g,h
Modified df2
:
First Name Last Name Firm Names Address ID
0 Adam A Firm1 ABC a
1 Harry B Firm1 ABC b
2 Andrew C Firm1 ABC c
3 Mike A Firm2 DEF d
4 Sheila D Firm3 GHI e
5 Hash E Firm3 GHI f
6 Michelle F Firm4 JKL g
7 Morty G Firm4 JKL h
pre-filtering df2
:
valid_names = '/'.join(df1['Name']).split('/')
out = df1.merge(df2[df2['Last Name'].isin(valid_names)]
.groupby(['Firm Names', 'Address'], as_index=False)
.agg({'Last Name': frozenset}),
left_on=df1['Name'].str.split('/').apply(frozenset),
right_on='Last Name', how='left'
).drop(columns='Last Name')
Output:
Name Firm Names Address
0 A/B/C Firm1 MA
1 D/E Firm3 PS
Used input:
df1 = pd.DataFrame({'Name': ['A/B/C', 'D/E']})
df2 = pd.DataFrame({'First Name': ['Adam', 'Harry', 'Andrew', 'Mike', 'Sheila', 'Hash', 'ABC'],
'Last Name': ['A', 'B', 'C', 'B', 'D', 'E', 'XYZ'],
'Firm Names': ['Firm1', 'Firm1', 'Firm1', 'Firm2', 'Firm3', 'Firm3','Firm1'],
'Address':['MA', 'MA', 'MA', 'BO', 'PS', 'PS', 'MA']})