Keep the row for which the values of two columns match by group otherwise keep the first row by group

Question:

I wanted to left join df2 on df1 and then keep the row that matches by group and if there is no matching group then I would like to keep the first row of the group in order to achieve df3 (the desired result). I was hoping you guys could help me with finding the optimal solution.

Here is my code to create the two dataframes and the required result.

import pandas as pd
import numpy as np

market = ['SP', 'SP', 'SP']
underlying = ['TSLA', 'GOOG', 'MSFT']
  
# DF1
df = pd.DataFrame(list(zip(market, underlying)),
               columns=['market', 'underlying'])

market2 = ['SP', 'SP', 'SP', 'SP', 'SP']
underlying2 = [None, 'TSLA', 'GBX', 'GBM', 'GBS']
client2 = [17, 12, 100, 21, 10]

# DF2
df2 = pd.DataFrame(list(zip(market2, underlying2, client2)),
               columns=['market', 'underlying', 'client'])

market3 = ['SP', 'SP', 'SP']
underlying3 = ['TSLA', 'GOOG', 'MSFT']
client3 = [12, 17, 17]

# Desired
df3 = pd.DataFrame(list(zip(market3, underlying3, client3)),
               columns =['market', 'underlying', 'client'])

# This works but feels sub optimal
df3 = pd.merge(df, 
               df2,
               how='left',
               on=['market', 'underlying'])
df3 = pd.merge(df3, 
               df2,
               how='left',
               on=['market'])
df3 = df3.drop_duplicates(['market', 'underlying_x'])
df3['client'] = df3['client_x'].combine_first(df3['client_y'])
df3 = df3.drop(labels=['underlying_y', 'client_x', 'client_y'], axis=1)
df3 = df3.rename(columns={'underlying_x': 'underlying'})

Hope you guys could help, thankyou so much!

Asked By: Guyon Van Rooij

||

Answers:

Store the first value (a groupby might not be necessary if every single one in market is ‘SP’), merge and fill with the first value:

fill_value = df2.groupby('market').client.first()

# if you are interested in filtering for None:
fill_value = df2.set_index('market').loc[lambda df: df.underlying.isna(), 'client']


(df
.merge(
    df2, 
    on = ['market', 'underlying'], 
    how = 'left')
.set_index('market')
.fillna({'client':fill_value}, downcast='infer')
)

       underlying  client
market                   
SP           TSLA      12
SP           GOOG      17
SP           MSFT      17
Answered By: sammywemmy
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.