How to remove rows only once from dataframe that exists in another dataframe
Question:
Sorry, If i ask stupid questions since I am trying to learn.
Let’s say I have two dataframes like these:
dataframe1 = pd.DataFrame({'col1': [1, 2, 3, 4, 2], 'col2': ['a', 'b', 'c', 'd', 'b']})
dataframe2 = pd.DataFrame({'col1': [2, 4], 'col2': ['b', 'd']})
I tried this:
merged = pd.merge(dataframe1, dataframe2, how='outer', indicator=True)
result = merged[merged['_merge'] == 'left_only'][dataframe1.columns]
output:
col1 col2
0 1 a
3 3 c
But I want to get this dataframe as a result (so only delete as much as amount in other dataframe):
col1 col2
0 1 a
3 3 c
4 2 b
Could you please help me? Or send the link if it is answered in different thread because I couldn’t find.
Thank you!
Answers:
You were almost there!
De-duplicate with groupby.cumcount
before the merge
and follow your logic:
cols = ['col1', 'col2'] # or list(dataframe1)
merged = pd.merge(dataframe1.assign(n=dataframe1.groupby(cols).cumcount()),
dataframe2.assign(n=dataframe2.groupby(cols).cumcount()),
how='outer', indicator=True)
result = merged[merged['_merge'] == 'left_only'][dataframe1.columns]
Output:
col1 col2
0 1 a
2 3 c
4 2 b
Intermediate merged
:
col1 col2 n _merge
0 1 a 0 left_only
1 2 b 0 both
2 3 c 0 left_only
3 4 d 0 both
4 2 b 1 left_only
Sorry, If i ask stupid questions since I am trying to learn.
Let’s say I have two dataframes like these:
dataframe1 = pd.DataFrame({'col1': [1, 2, 3, 4, 2], 'col2': ['a', 'b', 'c', 'd', 'b']})
dataframe2 = pd.DataFrame({'col1': [2, 4], 'col2': ['b', 'd']})
I tried this:
merged = pd.merge(dataframe1, dataframe2, how='outer', indicator=True)
result = merged[merged['_merge'] == 'left_only'][dataframe1.columns]
output:
col1 col2
0 1 a
3 3 c
But I want to get this dataframe as a result (so only delete as much as amount in other dataframe):
col1 col2
0 1 a
3 3 c
4 2 b
Could you please help me? Or send the link if it is answered in different thread because I couldn’t find.
Thank you!
You were almost there!
De-duplicate with groupby.cumcount
before the merge
and follow your logic:
cols = ['col1', 'col2'] # or list(dataframe1)
merged = pd.merge(dataframe1.assign(n=dataframe1.groupby(cols).cumcount()),
dataframe2.assign(n=dataframe2.groupby(cols).cumcount()),
how='outer', indicator=True)
result = merged[merged['_merge'] == 'left_only'][dataframe1.columns]
Output:
col1 col2
0 1 a
2 3 c
4 2 b
Intermediate merged
:
col1 col2 n _merge
0 1 a 0 left_only
1 2 b 0 both
2 3 c 0 left_only
3 4 d 0 both
4 2 b 1 left_only