How to remove rows in a Pandas dataframe if the same row exists in another dataframe?
Question:
I have two dataframes:
df1 = row1;row2;row3
df2 = row4;row5;row6;row2
I want my output dataframe to only contain the rows unique in df1, i.e.:
df_out = row1;row3
How do I get this most efficiently?
This code does what I want, but using 2 for-loops:
a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})
match_ident = []
for i in range(0,len(a)):
found=False
for j in range(0,len(b)):
if a[0][i]==b[0][j]:
if a[1][i]==b[1][j]:
found=True
match_ident.append(not(found))
a = a[match_ident]
Answers:
You an use merge
with parameter indicator
and outer join, query
for filtering and then remove helper column with drop
:
DataFrames are joined on all columns, so on
parameter can be omit.
print (pd.merge(a,b, indicator=True, how='outer')
.query('_merge=="left_only"')
.drop('_merge', axis=1))
0 1
0 1 10
2 3 30
You could convert a
and b
into Index
s, then use the Index.isin
method to determine which rows are shared in common:
import pandas as pd
a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})
a_index = a.set_index([0,1]).index
b_index = b.set_index([0,1]).index
mask = ~a_index.isin(b_index)
result = a.loc[mask]
print(result)
yields
0 1
0 1 10
2 3 30
I have two dataframes:
df1 = row1;row2;row3
df2 = row4;row5;row6;row2
I want my output dataframe to only contain the rows unique in df1, i.e.:
df_out = row1;row3
How do I get this most efficiently?
This code does what I want, but using 2 for-loops:
a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})
match_ident = []
for i in range(0,len(a)):
found=False
for j in range(0,len(b)):
if a[0][i]==b[0][j]:
if a[1][i]==b[1][j]:
found=True
match_ident.append(not(found))
a = a[match_ident]
You an use merge
with parameter indicator
and outer join, query
for filtering and then remove helper column with drop
:
DataFrames are joined on all columns, so on
parameter can be omit.
print (pd.merge(a,b, indicator=True, how='outer')
.query('_merge=="left_only"')
.drop('_merge', axis=1))
0 1
0 1 10
2 3 30
You could convert a
and b
into Index
s, then use the Index.isin
method to determine which rows are shared in common:
import pandas as pd
a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})
a_index = a.set_index([0,1]).index
b_index = b.set_index([0,1]).index
mask = ~a_index.isin(b_index)
result = a.loc[mask]
print(result)
yields
0 1
0 1 10
2 3 30