Comparing two data frames with different columns and getting the differences

Question

I have a similar question as here Comparing two data frames and getting the differences
But columns in df1 is a subset of columns in df2.

df1:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green

df2:
Date       Fruit  Num  Color  A
2013-11-24 Banana 22.1 Yellow 1 
2013-11-24 Orange  8.6 Orange 2
2013-11-24 Apple   7.6 Green  3 
2013-11-24 Celery 10.2 Green  4
2013-11-25 Apple  22.1 Red    5
2013-11-25 Orange  8.6 Orange 6

I would like to get the difference the two df by comparing those columns in common only.
So the result I expect to get is

         Date   Fruit   Num   Color A
4  2013-11-25   Apple  22.1     Red 5
5  2013-11-25  Orange   8.6  Orange 6

Is there a way to do so? Any help is appreciated.

Asked By: test tes

||

Source

Answer 1

First you get the column names of df1

df1_columns = df1.columns # ["Date", "Fruit", "Num", "Color"]

Now you create a new df2 dataframe with only df1 columns

df2_filtered = df2[df1_columns]

And now you can apply the solution from this other question.

#concatenate both dataframes
df = pd.concat([df1, df2_filtered])
df = df.reset_index(drop=True)

#group by
df_gpby = df.groupby(list(df.columns))

# get index of unique records
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

#filter
df.reindex(idx)

Hope it helps!

Answered By: metc

Answer 2

If you don’t have duplicates in df1 and df2[df1.columns] you could try to use .drop_duplicates with keep=False:

res = pd.concat([df1, df2[df1.columns]]).drop_duplicates(keep=False)

Result for your sample dataframes:

         Date   Fruit   Num   Color
4  2013-11-25   Apple  22.1     Red
5  2013-11-25  Orange   8.6  Orange

PS: As far as I can see the other answer also covers only the non-duplicate case.

Answered By: Timus

Comparing two data frames with different columns and getting the differences

Question:

Answers: