how to compare two dataframes and return a new dataframe with only the records that have changed

Question:

I want to build a python script that will compare two pandas dataframes and create a new df that I can use to update my sql table. I create df1 by reading the existing table. I create df2 by reading the new data through an API call. I want to isolate changed lines and update the SQL table with the new values.

I have attempted to compare through an outer merge, but I need help returning the dataframe with only records with a different value in any field.

Here is my example df1:

enter image description here

Here is my example df2:

enter image description here

My desired output:

enter image description here

This function returns the entire dataframe and isn’t working as expected:

def compare_dataframes(df1, df2, pk_col):
    # Merge the two dataframes on the primary key column
    df_merged = pd.merge(df1, df2, on=pk_col, how='outer', suffixes=('_old', '_new'))

    # Identify the rows that are different between the two dataframes
    df_diff = df_merged[df_merged.isna().any(axis=1)]

    # Drop the columns from the old dataframe and rename the columns from the new dataframe
    df_diff = df_diff.drop(columns=[col for col in df_diff.columns if col.endswith('_old')])
    df_diff = df_diff.rename(columns={col: col.replace('_new', '') for col in df_diff.columns})

    return df_diff
Asked By: Mike Mann

||

Answers:

One approach could be to concatenate the 2 dataframes and then remove duplicates as shown below:

dict = {1:df1,2:df2}
df=pd.concat(dict)
df.drop_duplicates(keep=False)

As provided in answer to similar question:
https://stackoverflow.com/a/42649293/21442120

import sys 
from io import StringIO
import pandas as pd

DF1 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,,
2,x,y,z,
3,x,y,z,b
4,x,y,,b""")

DF2 = StringIO("""
id,field1,field2,field3,field4
0,x,y,,b
1,x,,a,
2,x,y,z,
3,x,y,z,b
4,x,y,a,b
""")

df1 = pd.read_table(DF1, sep=',', index_col='id')
df2 = pd.read_table(DF2, sep=',', index_col='id')

# STEP 1
dictionary = {1:df1,2:df2}
df=pd.concat(dictionary)
df3 = df.drop_duplicates(keep=False).reset_index()

# STEP 2
df4 = df3.drop_duplicates(subset=['id'], keep='last')
df4 = df4.drop('level_0', axis=1)
df4.head()

Gives Output as Desired:

id  field1  field2  field3  field4
1   1   x   NaN a   NaN
2   4   x   y   a   b
Answered By: insanity
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.