Comparing two data frames with different columns and getting the differences
Question:
I have a similar question as here Comparing two data frames and getting the differences
But columns in df1 is a subset of columns in df2.
df1:
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2:
Date Fruit Num Color A
2013-11-24 Banana 22.1 Yellow 1
2013-11-24 Orange 8.6 Orange 2
2013-11-24 Apple 7.6 Green 3
2013-11-24 Celery 10.2 Green 4
2013-11-25 Apple 22.1 Red 5
2013-11-25 Orange 8.6 Orange 6
I would like to get the difference the two df by comparing those columns in common only.
So the result I expect to get is
Date Fruit Num Color A
4 2013-11-25 Apple 22.1 Red 5
5 2013-11-25 Orange 8.6 Orange 6
Is there a way to do so? Any help is appreciated.
Answers:
First you get the column names of df1
df1_columns = df1.columns # ["Date", "Fruit", "Num", "Color"]
Now you create a new df2 dataframe with only df1 columns
df2_filtered = df2[df1_columns]
And now you can apply the solution from this other question.
#concatenate both dataframes
df = pd.concat([df1, df2_filtered])
df = df.reset_index(drop=True)
#group by
df_gpby = df.groupby(list(df.columns))
# get index of unique records
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
#filter
df.reindex(idx)
Hope it helps!
If you don’t have duplicates in df1
and df2[df1.columns]
you could try to use .drop_duplicates
with keep=False
:
res = pd.concat([df1, df2[df1.columns]]).drop_duplicates(keep=False)
Result for your sample dataframes:
Date Fruit Num Color
4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange
PS: As far as I can see the other answer also covers only the non-duplicate case.
I have a similar question as here Comparing two data frames and getting the differences
But columns in df1 is a subset of columns in df2.
df1:
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2:
Date Fruit Num Color A
2013-11-24 Banana 22.1 Yellow 1
2013-11-24 Orange 8.6 Orange 2
2013-11-24 Apple 7.6 Green 3
2013-11-24 Celery 10.2 Green 4
2013-11-25 Apple 22.1 Red 5
2013-11-25 Orange 8.6 Orange 6
I would like to get the difference the two df by comparing those columns in common only.
So the result I expect to get is
Date Fruit Num Color A
4 2013-11-25 Apple 22.1 Red 5
5 2013-11-25 Orange 8.6 Orange 6
Is there a way to do so? Any help is appreciated.
First you get the column names of df1
df1_columns = df1.columns # ["Date", "Fruit", "Num", "Color"]
Now you create a new df2 dataframe with only df1 columns
df2_filtered = df2[df1_columns]
And now you can apply the solution from this other question.
#concatenate both dataframes
df = pd.concat([df1, df2_filtered])
df = df.reset_index(drop=True)
#group by
df_gpby = df.groupby(list(df.columns))
# get index of unique records
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
#filter
df.reindex(idx)
Hope it helps!
If you don’t have duplicates in df1
and df2[df1.columns]
you could try to use .drop_duplicates
with keep=False
:
res = pd.concat([df1, df2[df1.columns]]).drop_duplicates(keep=False)
Result for your sample dataframes:
Date Fruit Num Color
4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange
PS: As far as I can see the other answer also covers only the non-duplicate case.