Diff of two Dataframes
Question:
I need to compare two dataframes of different size row-wise and print out non matching rows. Lets take the following two:
df1 = DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [18, 3, 5, ]})
df2 = DataFrame({
'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'],
'Quantity': [2, 1, 18, 5]})
What is the most efficient way to row-wise over df2 and print out rows not in df1 e.g.
Buyer Quantity
Carl 2
Mark 1
Important: I do not want to have row:
Buyer Quantity
Carl 3
Included in the diff:
I have already tried:
Comparing two dataframes of different length row by row and adding columns for each row with equal value
and Compare two DataFrames and output their differences side-by-side
But these do not match with my problem.
Answers:
merge
the 2 dfs using method ‘outer’ and pass param indicator=True
this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:
In [22]:
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']
Out[22]:
Buyer Quantity _merge
3 Carl 2 right_only
4 Mark 1 right_only
diff = set(zip(df2.Buyer, df2.Quantity)) - set(zip(df1.Buyer, df1.Quantity))
This is the first solution that came to mind. You can then put the diff set back in a DF for presentation.
Try the following if you only care about adding the new Buyers to the other df:
df_delta=df2[df2['Buyer'].apply(lambda x: x not in df1['Buyer'].values)]
you may find this as the best:
df2[ ~df2.isin(df1)].dropna()
@EdChum’s answer is self-explained. But using not 'both'
condition makes more sense and you do not need to care about the order of comparison, and this is what a real diff supposed to be. For the sake of answering your question:
merged = df1.merge(df2, indicator=True, how='outer')
merged.loc = [merged['_merge'] != 'both']
As of Pandas 1.1.0, there is pandas.DataFrame.compare:
df1.compare(df2)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.compare.html
An important edge case
Consider the following, where you have an additional duplicate entry in the second dataframe. ('Carl', 5)
df1 = DataFrame({ 'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [ 18 , 3 , 5 ] })
df2 = DataFrame({ 'Buyer': ['Carl', 'Mark', 'Carl', 'Carl', 'Carl'],
'Quantity': [ 2 , 1 , 18 , 5 , 5 ] })
EdChum’s answer will give you the following:
merged = df1.merge(df2, indicator=True, how='outer')
print(merged[merged['_merge'] == 'right_only'])
Buyer Quantity _merge
4 Carl 2 right_only
5 Mark 1 right_only
As you can see, the solution ignores the additional duplicate value, which depending on what you are doing is something you want to avoid.
Here is a solution that more likely does what you want:
df1['duplicate_counter'] = df1.groupby(list(df1.columns)).cumcount()
df2['duplicate_counter'] = df2.groupby(list(df2.columns)).cumcount()
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']
Buyer Quantity duplicate_counter _merge
3 Carl 2 0 right_only
4 Mark 1 0 right_only
5 Carl 5 1 right_only
The duplicate counter ensures that every row is unique, which means that duplicate values are not removed. After merging, you can drop the duplicate_counter.
I need to compare two dataframes of different size row-wise and print out non matching rows. Lets take the following two:
df1 = DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [18, 3, 5, ]})
df2 = DataFrame({
'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'],
'Quantity': [2, 1, 18, 5]})
What is the most efficient way to row-wise over df2 and print out rows not in df1 e.g.
Buyer Quantity
Carl 2
Mark 1
Important: I do not want to have row:
Buyer Quantity
Carl 3
Included in the diff:
I have already tried:
Comparing two dataframes of different length row by row and adding columns for each row with equal value
and Compare two DataFrames and output their differences side-by-side
But these do not match with my problem.
merge
the 2 dfs using method ‘outer’ and pass param indicator=True
this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:
In [22]:
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']
Out[22]:
Buyer Quantity _merge
3 Carl 2 right_only
4 Mark 1 right_only
diff = set(zip(df2.Buyer, df2.Quantity)) - set(zip(df1.Buyer, df1.Quantity))
This is the first solution that came to mind. You can then put the diff set back in a DF for presentation.
Try the following if you only care about adding the new Buyers to the other df:
df_delta=df2[df2['Buyer'].apply(lambda x: x not in df1['Buyer'].values)]
you may find this as the best:
df2[ ~df2.isin(df1)].dropna()
@EdChum’s answer is self-explained. But using not 'both'
condition makes more sense and you do not need to care about the order of comparison, and this is what a real diff supposed to be. For the sake of answering your question:
merged = df1.merge(df2, indicator=True, how='outer')
merged.loc = [merged['_merge'] != 'both']
As of Pandas 1.1.0, there is pandas.DataFrame.compare:
df1.compare(df2)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.compare.html
An important edge case
Consider the following, where you have an additional duplicate entry in the second dataframe. ('Carl', 5)
df1 = DataFrame({ 'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [ 18 , 3 , 5 ] })
df2 = DataFrame({ 'Buyer': ['Carl', 'Mark', 'Carl', 'Carl', 'Carl'],
'Quantity': [ 2 , 1 , 18 , 5 , 5 ] })
EdChum’s answer will give you the following:
merged = df1.merge(df2, indicator=True, how='outer')
print(merged[merged['_merge'] == 'right_only'])
Buyer Quantity _merge
4 Carl 2 right_only
5 Mark 1 right_only
As you can see, the solution ignores the additional duplicate value, which depending on what you are doing is something you want to avoid.
Here is a solution that more likely does what you want:
df1['duplicate_counter'] = df1.groupby(list(df1.columns)).cumcount()
df2['duplicate_counter'] = df2.groupby(list(df2.columns)).cumcount()
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']
Buyer Quantity duplicate_counter _merge
3 Carl 2 0 right_only
4 Mark 1 0 right_only
5 Carl 5 1 right_only
The duplicate counter ensures that every row is unique, which means that duplicate values are not removed. After merging, you can drop the duplicate_counter.