Pandas – merge dataframe to keep all values on left and 'insert' values from right if 'no key on left' else 'update' existing 'key' in left

Question:

I have two dataframes df1 and df2.

np.random.seed(0)
df1= pd.DataFrame({'key': ['A', 'B', 'C', 'D'],'id': ['2', '23', '234', '2345'], '2021': np.random.randn(4)})
df2= pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'id': ['23', '2345', '67', '45'],'2022': np.random.randn(4)})

  key    id      2021
0   A     2  1.764052
1   B    23  0.400157
2   C   234  0.978738
3   D  2345  2.240893

  key    id      2022
0   B    23  1.867558
1   D  2345 -0.977278
2   E    67  0.950088
3   F    45 -0.151357

I want to have unique keys. If key found already just update the key else insert new row.
I am not sure if I have to use merge/concat/join. Can anyone give insight on this please?

Note:I have used full outer join, it returns duplicate columns. Have edited the input dataframes after posting the question.

Thanks!

Asked By: Poongodi

||

Answers:

I think you need create index from key and then join in concat:

df = pd.concat([df1.set_index('key'), df2.set_index('key')], axis=1).reset_index()
print (df)
  key      2021      2022
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
Answered By: jezrael

You can do it using merge function:

df = df1.merge(df2, on='key', how='outer')

df
   key     2021    2022
0   A   1.764052    NaN
1   B   0.400157    1.867558
2   C   0.978738    NaN
3   D   2.240893    -0.977278
4   E   NaN         0.950088
5   F   NaN        -0.151357

EDIT

In case you need to merge also for ‘id’:

df = df1.merge(df2, on=['key','id'], how='outer')

key id  2021    2022
A   2   1.764052    NaN
B   23  0.400157    1.867558
C   234 0.978738    NaN
D   2345    2.240893    -0.977278
E   67  NaN 0.950088
F   45  NaN -0.151357
Answered By: Andrea Ierardi

Given your description, it looks like you want combine_first. It will merge the two datasets by replacing the duplicates in order.

df2.set_index('key'). combine_first(df1.set_index('key')).reset_index()

Output:

  key      2021      2022
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.