Pandas – merge dataframe to keep all values on left and 'insert' values from right if 'no key on left' else 'update' existing 'key' in left
Question:
I have two dataframes df1 and df2.
np.random.seed(0)
df1= pd.DataFrame({'key': ['A', 'B', 'C', 'D'],'id': ['2', '23', '234', '2345'], '2021': np.random.randn(4)})
df2= pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'id': ['23', '2345', '67', '45'],'2022': np.random.randn(4)})
key id 2021
0 A 2 1.764052
1 B 23 0.400157
2 C 234 0.978738
3 D 2345 2.240893
key id 2022
0 B 23 1.867558
1 D 2345 -0.977278
2 E 67 0.950088
3 F 45 -0.151357
I want to have unique keys. If key found already just update the key else insert new row.
I am not sure if I have to use merge/concat/join. Can anyone give insight on this please?
Note:I have used full outer join, it returns duplicate columns. Have edited the input dataframes after posting the question.
Thanks!
Answers:
I think you need create index from key
and then join in concat
:
df = pd.concat([df1.set_index('key'), df2.set_index('key')], axis=1).reset_index()
print (df)
key 2021 2022
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
You can do it using merge function:
df = df1.merge(df2, on='key', how='outer')
df
key 2021 2022
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
EDIT
In case you need to merge also for ‘id’:
df = df1.merge(df2, on=['key','id'], how='outer')
key id 2021 2022
A 2 1.764052 NaN
B 23 0.400157 1.867558
C 234 0.978738 NaN
D 2345 2.240893 -0.977278
E 67 NaN 0.950088
F 45 NaN -0.151357
Given your description, it looks like you want combine_first
. It will merge the two datasets by replacing the duplicates in order.
df2.set_index('key'). combine_first(df1.set_index('key')).reset_index()
Output:
key 2021 2022
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
I have two dataframes df1 and df2.
np.random.seed(0)
df1= pd.DataFrame({'key': ['A', 'B', 'C', 'D'],'id': ['2', '23', '234', '2345'], '2021': np.random.randn(4)})
df2= pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'id': ['23', '2345', '67', '45'],'2022': np.random.randn(4)})
key id 2021
0 A 2 1.764052
1 B 23 0.400157
2 C 234 0.978738
3 D 2345 2.240893
key id 2022
0 B 23 1.867558
1 D 2345 -0.977278
2 E 67 0.950088
3 F 45 -0.151357
I want to have unique keys. If key found already just update the key else insert new row.
I am not sure if I have to use merge/concat/join. Can anyone give insight on this please?
Note:I have used full outer join, it returns duplicate columns. Have edited the input dataframes after posting the question.
Thanks!
I think you need create index from key
and then join in concat
:
df = pd.concat([df1.set_index('key'), df2.set_index('key')], axis=1).reset_index()
print (df)
key 2021 2022
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
You can do it using merge function:
df = df1.merge(df2, on='key', how='outer')
df
key 2021 2022
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
EDIT
In case you need to merge also for ‘id’:
df = df1.merge(df2, on=['key','id'], how='outer')
key id 2021 2022
A 2 1.764052 NaN
B 23 0.400157 1.867558
C 234 0.978738 NaN
D 2345 2.240893 -0.977278
E 67 NaN 0.950088
F 45 NaN -0.151357
Given your description, it looks like you want combine_first
. It will merge the two datasets by replacing the duplicates in order.
df2.set_index('key'). combine_first(df1.set_index('key')).reset_index()
Output:
key 2021 2022
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357