Split and match names in a df with another df and return the matching firms


Please help me with this particular scenario. I’ve been able to partially do this but the final dataframe does not look correct for all the rows.

I have two Dataframes:




First Name Last Name Firm Names
Adam A Firm1
Harry B Firm1
Andrew C Firm1
Mike A Firm2
Sheila D Firm3
Hash E Firm3
Michelle F Firm4
Morty G Firm4

Now df1 contains only the last names with a slash(/). I want to match the last names in df1 with df2 and when it finds a common firm name for all of A and B and C for example then return that firm name for that row. If you notice A/B/C in df1 there are multiple firm names for the same last name in df2. I only want the common firm name for all the three last names in that row.

So my final data frame should look something like this :

Name Firm Name
A/B/C Firm1
D/E Firm2
F/G Firm3
Asked By: olivia



If ordering of Last Names is same in both DataFrames is possible use:

out = df1[['Name']].merge((df1['Name'].str.split('/')
                                     .rename('Last Name')
                                     .merge(df2, how='left')
                                     .groupby(['index','Firm Names'])
                                     .agg(Name=('Last Name', '/'.join))
                                     .reset_index(level=1)), how='left')
print (out)
    Name Firm Names
0  A/B/C      Firm1
1    D/E      Firm3
2    F/G      Firm4

More general solution is use frozensets for matching fimes with any ordering:

First create frozensets with splitted values to helper DataFramedf1:

df11 = df1.assign(**{'Last Name':df1['Name'].str.split('/'),
                    'sets':lambda x: x['Last Name'].apply(frozenset)})
print (df11)
    Name  Last Name       sets
0  A/B/C  [A, B, C]  (B, A, C)
1    D/E     [D, E]     (D, E)
2    F/G     [F, G]     (F, G)

Use DataFrame.explode for column from lists and left join with second DataFrame by DataFrame.merge, create sets for each firm names:

df22 =  (df11.explode('Last Name')
             .merge(df2, how='left')
             .groupby(['index','Firm Names'])
             .agg(sets=('Last Name', frozenset))
print (df22)
      Firm Names       sets
0          Firm1  (B, A, C)
0          Firm2        (A)
1          Firm3     (D, E)
2          Firm4     (F, G)

Last left join to original df1 and filter columns names:

out = df11.merge(df22, how='left')[['Name','Firm Names']]
print (out)
    Name Firm Names
0  A/B/C      Firm1
1    D/E      Firm3
2    F/G      Firm4
Answered By: jezrael
import pandas as pd
df1 = pd.DataFrame({"Name":["A/B/C","D/E","F/G"]}).rename(columns= {"Name":"Last Name"})
df2 = pd.DataFrame({'First Name': ['Adam', 'Harry', 'Andrew', 'Mike', 'Sheila', 'Hash', 'Michelle', 'Morty'], 'Last Name': ['A', 'B', 'C', 'A', 'D', 'E', 'F', 'G'], 'Firm Names': ['Firm1', 'Firm1', 'Firm1', 'Firm2', 'Firm3', 'Firm3', 'Firm4', 'Firm4']})

unique_firms = df2["Firm Names"].unique()
splitted_names = [set(n.split("/")) for n in df1["Last Name"]]
firm_dict = {firm:set(df2[df2["Firm Names"] == firm]["Last Name"]) for firm in unique_firms}

data1 = [('/'.join(v),k) for k,v in firm_dict.items() if v in splitted_names]
df3 = pd.DataFrame(data1 , columns=["Name", "Firm Name"])


Name Firm Name
A/B/C Firm1
D/E Firm2
F/G Firm3

If you need to keep the first names and all the firms as well use the following code:

data2 = [(firm,'/'.join(list(df2[df2["Firm Names"] == firm]["Last Name"])),'/'.join(list(df2[df2["Firm Names"] == firm]["First Name"]))) for firm in unique_firms]
df4 = pd.DataFrame(data2, columns=["Name", "    Last Name", "Firm Name"])


Firm Names Last Name First Name
Firm1 A/B/C Adam/Harry/Andrew
Firm2 A Mike
Firm3 D/E Sheila/Hash
Firm4 F/G Michelle/Morty
Answered By: Gооd_Mаn

You can use a simple merge with a key as frozenset, no need to explode:

out = df1.merge(df2.groupby(['Firm Names'], as_index=False)
                ['Last Name'].agg(frozenset),
                right_on='Last Name'
               ).drop(columns='Last Name')


    Name Firm Names
0  A/B/C      Firm1
1    D/E      Firm3
2    F/G      Firm4

Handling other columns

If you have other columns and they are matching the Firm Names (i.e., a given First Name has a single Address), then just include those in the groupby, if the values are different for a given Firm Name, you have to aggregate. Below is an example of both:

out = df1.merge(df2.groupby(['Firm Names', 'Address'], as_index=False)
                   .agg({'Last Name': frozenset, 'ID': ','.join}),
                right_on='Last Name'
               ).drop(columns='Last Name')


    Name Firm Names Address     ID
0  A/B/C      Firm1     ABC  a,b,c
1    D/E      Firm3     GHI    e,f
2    F/G      Firm4     JKL    g,h

Modified df2:

  First Name Last Name Firm Names Address ID
0       Adam         A      Firm1     ABC  a
1      Harry         B      Firm1     ABC  b
2     Andrew         C      Firm1     ABC  c
3       Mike         A      Firm2     DEF  d
4     Sheila         D      Firm3     GHI  e
5       Hash         E      Firm3     GHI  f
6   Michelle         F      Firm4     JKL  g
7      Morty         G      Firm4     JKL  h
pre-filtering df2:
valid_names = '/'.join(df1['Name']).split('/')

out = df1.merge(df2[df2['Last Name'].isin(valid_names)]
                   .groupby(['Firm Names', 'Address'], as_index=False)
                   .agg({'Last Name': frozenset}),
                right_on='Last Name', how='left'
               ).drop(columns='Last Name')


    Name Firm Names Address
0  A/B/C      Firm1      MA
1    D/E      Firm3      PS

Used input:

df1 = pd.DataFrame({'Name': ['A/B/C', 'D/E']})
df2 = pd.DataFrame({'First Name': ['Adam', 'Harry', 'Andrew', 'Mike', 'Sheila', 'Hash', 'ABC'],
                    'Last Name': ['A', 'B', 'C', 'B', 'D', 'E', 'XYZ'], 
                    'Firm Names': ['Firm1', 'Firm1', 'Firm1', 'Firm2', 'Firm3', 'Firm3','Firm1'], 
                    'Address':['MA', 'MA', 'MA', 'BO', 'PS', 'PS', 'MA']})
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.