How to compare two python pandas dataframe and find unmatched column name as status

Question:

I want to compare two pandas dataframe and i have to print row which are not matching along with column anme which is not matching as remark column. my dataframe look like

            df1 
id first_name last_name salary
 1        AAA       FFF   1000
 2        BBB       GGG   1000
 3        CCC       HHH   1000
 4        DDD       III   1000
 5        EEE       JJJ   1000
 7        PPP       QQQ   5000
             df2
 id first_name last_name salary
 1        AAA       FFF   2000
 2        BBB       GGG   1000
 3        CCC       HHH   1000
 4        OOO       III   1000
 5        EEE       JJJ   1000
 6        YYY       ZZZ   5000
           expected df
 id first_name last_name salary remark
 1        AAA       FFF   1000
 1        AAA       FFF   2000  salary
 4        DDD       III   1000
 4        OOO       III   1000  first_name
 6        YYY       ZZZ   5000  not present in df1
 7        PPP       QQQ   5000  not present in df2

EDIT: 01 FEB 2023

            source_df
  id first_name last_name city salary
   1        AAA       FFF  bbb   1000
   2        BBB       GGG  sts   1000
   3        CCC       HHH  aaa   1000
   4        DDD       III  bbb   1000
   5        EEE       JJJ  sts   1000
   7        PPP       QQQ  aaa   5000
   8        lll       jjj        5000
             target_df
 id first_name last_name city salary
  1        AAA       FFF  bbb   2000
  2        BBB       GGG  sts   1000
  3        CCC       HHH  aaa   1000
  4        OOO       III  bbb   1000
  5        EEE       JJJ  tst   1000
  6        YYY       ZZZ  aaa   5000
           expected df
id first_name last_name city salary remark
 1        AAA       FFF  bbb   1000 salary
 1        AAA       FFF  bbb   2000 salary
 4        DDD       III  bbb   1000 first_name
 4        OOO       III  bbb   1000 first_name
 5        EEE       JJJ  sts   1000 city
 5        EEE       JJJ  tst   1000 city
 6        YYY       ZZZ  aaa   5000 only in target
 7        PPP       QQQ  aaa   5000 only in source
 8        lll       jjj        5000 only in source
 

I tried so much but i did not find expected soulution.

Asked By: Dhruv Rajkotiya

||

Answers:

I would use:

tmp1 = pd.concat([df1.set_index('id'), df2.set_index('id')],
                 keys=['df1', 'df2'], axis=1)

tmp2 = tmp1['df1'].ne(tmp1['df2'])

m1 = tmp2.any(axis=1)
m2 = tmp2.all(axis=1)

out = tmp1[m1].stack(0).reset_index(1)

out['remark'] = tmp2[m1].dot(df1.columns.difference(['id'])+', ').str[:-2]

out.loc[m2, 'remark'] = 'Only present in ' + out.pop('level_1')[m2]
# or for "not present in"
# out.loc[m2, 'remark'] = 'Not present in ' + out.pop('level_1')[m2].map({'df1': 'df2', 'df2': 'df1'})

Output:

   first_name last_name  salary               remark
id                                                  
1         AAA       FFF  1000.0               salary
1         AAA       FFF  2000.0               salary
4         DDD       III  1000.0           first_name
4         OOO       III  1000.0           first_name
7         PPP       QQQ  5000.0  Only present in df1
6         YYY       ZZZ  5000.0  Only present in df2
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.