How to remove rows in a Pandas dataframe if the same row exists in another dataframe?


I have two dataframes:

 df1 = row1;row2;row3
 df2 = row4;row5;row6;row2

I want my output dataframe to only contain the rows unique in df1, i.e.:

df_out = row1;row3

How do I get this most efficiently?

This code does what I want, but using 2 for-loops:

a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})

match_ident = []
for i in range(0,len(a)):
    for j in range(0,len(b)):
        if a[0][i]==b[0][j]:
            if a[1][i]==b[1][j]:

a = a[match_ident]
Asked By: RRC



You an use merge with parameter indicator and outer join, query for filtering and then remove helper column with drop:

DataFrames are joined on all columns, so on parameter can be omit.

print (pd.merge(a,b, indicator=True, how='outer')
         .drop('_merge', axis=1))
   0   1
0  1  10
2  3  30
Answered By: jezrael

You could convert a and b into Indexs, then use the Index.isin method to determine which rows are shared in common:

import pandas as pd
a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})

a_index = a.set_index([0,1]).index
b_index = b.set_index([0,1]).index
mask = ~a_index.isin(b_index)
result = a.loc[mask]


   0   1
0  1  10
2  3  30
Answered By: unutbu
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.