How to compare two python pandas dataframe and find unmatched column name as status
Question:
I want to compare two pandas dataframe and i have to print row which are not matching along with column anme which is not matching as remark column. my dataframe look like
df1
id first_name last_name salary
1 AAA FFF 1000
2 BBB GGG 1000
3 CCC HHH 1000
4 DDD III 1000
5 EEE JJJ 1000
7 PPP QQQ 5000
df2
id first_name last_name salary
1 AAA FFF 2000
2 BBB GGG 1000
3 CCC HHH 1000
4 OOO III 1000
5 EEE JJJ 1000
6 YYY ZZZ 5000
expected df
id first_name last_name salary remark
1 AAA FFF 1000
1 AAA FFF 2000 salary
4 DDD III 1000
4 OOO III 1000 first_name
6 YYY ZZZ 5000 not present in df1
7 PPP QQQ 5000 not present in df2
EDIT: 01 FEB 2023
source_df
id first_name last_name city salary
1 AAA FFF bbb 1000
2 BBB GGG sts 1000
3 CCC HHH aaa 1000
4 DDD III bbb 1000
5 EEE JJJ sts 1000
7 PPP QQQ aaa 5000
8 lll jjj 5000
target_df
id first_name last_name city salary
1 AAA FFF bbb 2000
2 BBB GGG sts 1000
3 CCC HHH aaa 1000
4 OOO III bbb 1000
5 EEE JJJ tst 1000
6 YYY ZZZ aaa 5000
expected df
id first_name last_name city salary remark
1 AAA FFF bbb 1000 salary
1 AAA FFF bbb 2000 salary
4 DDD III bbb 1000 first_name
4 OOO III bbb 1000 first_name
5 EEE JJJ sts 1000 city
5 EEE JJJ tst 1000 city
6 YYY ZZZ aaa 5000 only in target
7 PPP QQQ aaa 5000 only in source
8 lll jjj 5000 only in source
I tried so much but i did not find expected soulution.
Answers:
I would use:
tmp1 = pd.concat([df1.set_index('id'), df2.set_index('id')],
keys=['df1', 'df2'], axis=1)
tmp2 = tmp1['df1'].ne(tmp1['df2'])
m1 = tmp2.any(axis=1)
m2 = tmp2.all(axis=1)
out = tmp1[m1].stack(0).reset_index(1)
out['remark'] = tmp2[m1].dot(df1.columns.difference(['id'])+', ').str[:-2]
out.loc[m2, 'remark'] = 'Only present in ' + out.pop('level_1')[m2]
# or for "not present in"
# out.loc[m2, 'remark'] = 'Not present in ' + out.pop('level_1')[m2].map({'df1': 'df2', 'df2': 'df1'})
Output:
first_name last_name salary remark
id
1 AAA FFF 1000.0 salary
1 AAA FFF 2000.0 salary
4 DDD III 1000.0 first_name
4 OOO III 1000.0 first_name
7 PPP QQQ 5000.0 Only present in df1
6 YYY ZZZ 5000.0 Only present in df2
I want to compare two pandas dataframe and i have to print row which are not matching along with column anme which is not matching as remark column. my dataframe look like
df1
id first_name last_name salary
1 AAA FFF 1000
2 BBB GGG 1000
3 CCC HHH 1000
4 DDD III 1000
5 EEE JJJ 1000
7 PPP QQQ 5000
df2
id first_name last_name salary
1 AAA FFF 2000
2 BBB GGG 1000
3 CCC HHH 1000
4 OOO III 1000
5 EEE JJJ 1000
6 YYY ZZZ 5000
expected df
id first_name last_name salary remark
1 AAA FFF 1000
1 AAA FFF 2000 salary
4 DDD III 1000
4 OOO III 1000 first_name
6 YYY ZZZ 5000 not present in df1
7 PPP QQQ 5000 not present in df2
EDIT: 01 FEB 2023
source_df
id first_name last_name city salary
1 AAA FFF bbb 1000
2 BBB GGG sts 1000
3 CCC HHH aaa 1000
4 DDD III bbb 1000
5 EEE JJJ sts 1000
7 PPP QQQ aaa 5000
8 lll jjj 5000
target_df
id first_name last_name city salary
1 AAA FFF bbb 2000
2 BBB GGG sts 1000
3 CCC HHH aaa 1000
4 OOO III bbb 1000
5 EEE JJJ tst 1000
6 YYY ZZZ aaa 5000
expected df
id first_name last_name city salary remark
1 AAA FFF bbb 1000 salary
1 AAA FFF bbb 2000 salary
4 DDD III bbb 1000 first_name
4 OOO III bbb 1000 first_name
5 EEE JJJ sts 1000 city
5 EEE JJJ tst 1000 city
6 YYY ZZZ aaa 5000 only in target
7 PPP QQQ aaa 5000 only in source
8 lll jjj 5000 only in source
I tried so much but i did not find expected soulution.
I would use:
tmp1 = pd.concat([df1.set_index('id'), df2.set_index('id')],
keys=['df1', 'df2'], axis=1)
tmp2 = tmp1['df1'].ne(tmp1['df2'])
m1 = tmp2.any(axis=1)
m2 = tmp2.all(axis=1)
out = tmp1[m1].stack(0).reset_index(1)
out['remark'] = tmp2[m1].dot(df1.columns.difference(['id'])+', ').str[:-2]
out.loc[m2, 'remark'] = 'Only present in ' + out.pop('level_1')[m2]
# or for "not present in"
# out.loc[m2, 'remark'] = 'Not present in ' + out.pop('level_1')[m2].map({'df1': 'df2', 'df2': 'df1'})
Output:
first_name last_name salary remark
id
1 AAA FFF 1000.0 salary
1 AAA FFF 2000.0 salary
4 DDD III 1000.0 first_name
4 OOO III 1000.0 first_name
7 PPP QQQ 5000.0 Only present in df1
6 YYY ZZZ 5000.0 Only present in df2