Split and match names in a df with another df and return the matching firms

Question:

Please help me with this particular scenario. I’ve been able to partially do this but the final dataframe does not look correct for all the rows.

I have two Dataframes:

df1:

Name
A/B/C
D/E
F/G

df2:

First Name Last Name Firm Names
Adam A Firm1
Harry B Firm1
Andrew C Firm1
Mike A Firm2
Sheila D Firm3
Hash E Firm3
Michelle F Firm4
Morty G Firm4

Now df1 contains only the last names with a slash(/). I want to match the last names in df1 with df2 and when it finds a common firm name for all of A and B and C for example then return that firm name for that row. If you notice A/B/C in df1 there are multiple firm names for the same last name in df2. I only want the common firm name for all the three last names in that row.

So my final data frame should look something like this :

Name Firm Name
A/B/C Firm1
D/E Firm2
F/G Firm3
Asked By: olivia

||

Answers:

If ordering of Last Names is same in both DataFrames is possible use:

out = df1[['Name']].merge((df1['Name'].str.split('/')
                                     .explode()
                                     .rename('Last Name')
                                     .reset_index()
                                     .merge(df2, how='left')
                                     .groupby(['index','Firm Names'])
                                     .agg(Name=('Last Name', '/'.join))
                                     .reset_index(level=1)), how='left')
print (out)
    Name Firm Names
0  A/B/C      Firm1
1    D/E      Firm3
2    F/G      Firm4

More general solution is use frozensets for matching fimes with any ordering:

First create frozensets with splitted values to helper DataFramedf1:

df11 = df1.assign(**{'Last Name':df1['Name'].str.split('/'),
                    'sets':lambda x: x['Last Name'].apply(frozenset)})
print (df11)
    Name  Last Name       sets
0  A/B/C  [A, B, C]  (B, A, C)
1    D/E     [D, E]     (D, E)
2    F/G     [F, G]     (F, G)

Use DataFrame.explode for column from lists and left join with second DataFrame by DataFrame.merge, create sets for each firm names:

df22 =  (df11.explode('Last Name')
             .reset_index()
             .merge(df2, how='left')
             .groupby(['index','Firm Names'])
             .agg(sets=('Last Name', frozenset))
             .reset_index(level=1))
print (df22)
      Firm Names       sets
index                      
0          Firm1  (B, A, C)
0          Firm2        (A)
1          Firm3     (D, E)
2          Firm4     (F, G)

Last left join to original df1 and filter columns names:

out = df11.merge(df22, how='left')[['Name','Firm Names']]
print (out)
    Name Firm Names
0  A/B/C      Firm1
1    D/E      Firm3
2    F/G      Firm4
Answered By: jezrael
import pandas as pd
df1 = pd.DataFrame({"Name":["A/B/C","D/E","F/G"]}).rename(columns= {"Name":"Last Name"})
df2 = pd.DataFrame({'First Name': ['Adam', 'Harry', 'Andrew', 'Mike', 'Sheila', 'Hash', 'Michelle', 'Morty'], 'Last Name': ['A', 'B', 'C', 'A', 'D', 'E', 'F', 'G'], 'Firm Names': ['Firm1', 'Firm1', 'Firm1', 'Firm2', 'Firm3', 'Firm3', 'Firm4', 'Firm4']})

unique_firms = df2["Firm Names"].unique()
splitted_names = [set(n.split("/")) for n in df1["Last Name"]]
firm_dict = {firm:set(df2[df2["Firm Names"] == firm]["Last Name"]) for firm in unique_firms}

data1 = [('/'.join(v),k) for k,v in firm_dict.items() if v in splitted_names]
df3 = pd.DataFrame(data1 , columns=["Name", "Firm Name"])

output:

Name Firm Name
A/B/C Firm1
D/E Firm2
F/G Firm3

If you need to keep the first names and all the firms as well use the following code:

data2 = [(firm,'/'.join(list(df2[df2["Firm Names"] == firm]["Last Name"])),'/'.join(list(df2[df2["Firm Names"] == firm]["First Name"]))) for firm in unique_firms]
df4 = pd.DataFrame(data2, columns=["Name", "    Last Name", "Firm Name"])

output:

Firm Names Last Name First Name
Firm1 A/B/C Adam/Harry/Andrew
Firm2 A Mike
Firm3 D/E Sheila/Hash
Firm4 F/G Michelle/Morty
Answered By: Gооd_Mаn

You can use a simple merge with a key as frozenset, no need to explode:

out = df1.merge(df2.groupby(['Firm Names'], as_index=False)
                ['Last Name'].agg(frozenset),
                left_on=df1['Name'].str.split('/').apply(frozenset),
                right_on='Last Name'
               ).drop(columns='Last Name')

Output:

    Name Firm Names
0  A/B/C      Firm1
1    D/E      Firm3
2    F/G      Firm4

Handling other columns

If you have other columns and they are matching the Firm Names (i.e., a given First Name has a single Address), then just include those in the groupby, if the values are different for a given Firm Name, you have to aggregate. Below is an example of both:

out = df1.merge(df2.groupby(['Firm Names', 'Address'], as_index=False)
                   .agg({'Last Name': frozenset, 'ID': ','.join}),
                left_on=df1['Name'].str.split('/').apply(frozenset),
                right_on='Last Name'
               ).drop(columns='Last Name')

Example:

    Name Firm Names Address     ID
0  A/B/C      Firm1     ABC  a,b,c
1    D/E      Firm3     GHI    e,f
2    F/G      Firm4     JKL    g,h

Modified df2:

  First Name Last Name Firm Names Address ID
0       Adam         A      Firm1     ABC  a
1      Harry         B      Firm1     ABC  b
2     Andrew         C      Firm1     ABC  c
3       Mike         A      Firm2     DEF  d
4     Sheila         D      Firm3     GHI  e
5       Hash         E      Firm3     GHI  f
6   Michelle         F      Firm4     JKL  g
7      Morty         G      Firm4     JKL  h
pre-filtering df2:
valid_names = '/'.join(df1['Name']).split('/')

out = df1.merge(df2[df2['Last Name'].isin(valid_names)]
                   .groupby(['Firm Names', 'Address'], as_index=False)
                   .agg({'Last Name': frozenset}),
                left_on=df1['Name'].str.split('/').apply(frozenset),
                right_on='Last Name', how='left'
               ).drop(columns='Last Name')

Output:

    Name Firm Names Address
0  A/B/C      Firm1      MA
1    D/E      Firm3      PS

Used input:

df1 = pd.DataFrame({'Name': ['A/B/C', 'D/E']})
df2 = pd.DataFrame({'First Name': ['Adam', 'Harry', 'Andrew', 'Mike', 'Sheila', 'Hash', 'ABC'],
                    'Last Name': ['A', 'B', 'C', 'B', 'D', 'E', 'XYZ'], 
                    'Firm Names': ['Firm1', 'Firm1', 'Firm1', 'Firm2', 'Firm3', 'Firm3','Firm1'], 
                    'Address':['MA', 'MA', 'MA', 'BO', 'PS', 'PS', 'MA']})
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.