How to remove rows only once from dataframe that exists in another dataframe

Question:

Sorry, If i ask stupid questions since I am trying to learn.

Let’s say I have two dataframes like these:

dataframe1 = pd.DataFrame({'col1': [1, 2, 3, 4, 2], 'col2': ['a', 'b', 'c', 'd', 'b']})
dataframe2 = pd.DataFrame({'col1': [2, 4], 'col2': ['b', 'd']})

I tried this:

merged = pd.merge(dataframe1, dataframe2, how='outer', indicator=True)
result = merged[merged['_merge'] == 'left_only'][dataframe1.columns]

output:

col1 col2
0     1    a
3     3    c

But I want to get this dataframe as a result (so only delete as much as amount in other dataframe):

col1 col2
0     1    a
3     3    c
4     2    b

Could you please help me? Or send the link if it is answered in different thread because I couldn’t find.

Thank you!

Asked By: Merora

||

Answers:

You were almost there!

De-duplicate with groupby.cumcount before the merge and follow your logic:

cols = ['col1', 'col2'] # or list(dataframe1)

merged = pd.merge(dataframe1.assign(n=dataframe1.groupby(cols).cumcount()),
                  dataframe2.assign(n=dataframe2.groupby(cols).cumcount()),
                  how='outer', indicator=True)

result = merged[merged['_merge'] == 'left_only'][dataframe1.columns]

Output:

   col1 col2
0     1    a
2     3    c
4     2    b

Intermediate merged:

   col1 col2  n     _merge
0     1    a  0  left_only
1     2    b  0       both
2     3    c  0  left_only
3     4    d  0       both
4     2    b  1  left_only
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.