How to identify rows that contain duplicates in multiple columns and specific values in others
Question:
I have a dataframe that contains numerous columns and rows where I want to identify rows that are duplicates in multiple columns and in another column one duplicate contains a specific value and the other duplicate contains another specific value.
Example
ID FDAT LACT DATE EVENT
1 1/1/2022 1 31/1/2022 FRESH
1 1/1/2022 1 15/2/2022 LUT
1 1/1/2022 1 15/3/2022 BRED
1 1/1/2022 1 15/3/2022 OS
1 1/1/2022 1 15/3/2022 PREG
1 1/1/2022 1 30/3/2022 OS
1 1/1/2022 1 30/3/2022 PREG
I can check for duplicates and a specific value with the following code
df.loc[(df.duplicated(['ID', 'LACT', 'FDAT', 'DATE'])) & (df['EVENT']=='PREG')]
The problem is that I only want to include rows where one of the duplicate values on the same day is "BRED". In the example above this would be true on the 15th of March and false on the 30th of March. I would like to delete the "PREG" row on the 15th of March
I am looking for a way to make the statement above conditional of EVENT = ‘PREG’ and other EVENT from duplicate is ‘BRED’
The result I am looking to achieve is
ID FDAT LACT DATE EVENT
1 1/1/2022 1 31/1/2022 FRESH
1 1/1/2022 1 15/2/2022 LUT
1 1/1/2022 1 15/3/2022 BRED
1 1/1/2022 1 15/3/2022 OS
1 1/1/2022 1 30/3/2022 OS
1 1/1/2022 1 30/3/2022 PREG
Answers:
In your sophisticated case you need to group by columns 'ID', 'LACT', 'FDAT', 'DATE'
and apply additional filters by a certain events:
events = {'BRED', 'PREG'}
df.groupby(['ID', 'LACT', 'FDAT', 'DATE'], sort=False).apply(
lambda x: x[x['EVENT'].ne('PREG')] if set(x['EVENT']).issuperset(events)
else x).reset_index(drop=True)
ID FDAT LACT DATE EVENT
0 1 1/1/2022 1 31/1/2022 FRESH
1 1 1/1/2022 1 15/2/2022 LUT
2 1 1/1/2022 1 15/3/2022 BRED
3 1 1/1/2022 1 15/3/2022 OS
4 1 1/1/2022 1 30/3/2022 OS
5 1 1/1/2022 1 30/3/2022 PREG
I have a dataframe that contains numerous columns and rows where I want to identify rows that are duplicates in multiple columns and in another column one duplicate contains a specific value and the other duplicate contains another specific value.
Example
ID FDAT LACT DATE EVENT
1 1/1/2022 1 31/1/2022 FRESH
1 1/1/2022 1 15/2/2022 LUT
1 1/1/2022 1 15/3/2022 BRED
1 1/1/2022 1 15/3/2022 OS
1 1/1/2022 1 15/3/2022 PREG
1 1/1/2022 1 30/3/2022 OS
1 1/1/2022 1 30/3/2022 PREG
I can check for duplicates and a specific value with the following code
df.loc[(df.duplicated(['ID', 'LACT', 'FDAT', 'DATE'])) & (df['EVENT']=='PREG')]
The problem is that I only want to include rows where one of the duplicate values on the same day is "BRED". In the example above this would be true on the 15th of March and false on the 30th of March. I would like to delete the "PREG" row on the 15th of March
I am looking for a way to make the statement above conditional of EVENT = ‘PREG’ and other EVENT from duplicate is ‘BRED’
The result I am looking to achieve is
ID FDAT LACT DATE EVENT
1 1/1/2022 1 31/1/2022 FRESH
1 1/1/2022 1 15/2/2022 LUT
1 1/1/2022 1 15/3/2022 BRED
1 1/1/2022 1 15/3/2022 OS
1 1/1/2022 1 30/3/2022 OS
1 1/1/2022 1 30/3/2022 PREG
In your sophisticated case you need to group by columns 'ID', 'LACT', 'FDAT', 'DATE'
and apply additional filters by a certain events:
events = {'BRED', 'PREG'}
df.groupby(['ID', 'LACT', 'FDAT', 'DATE'], sort=False).apply(
lambda x: x[x['EVENT'].ne('PREG')] if set(x['EVENT']).issuperset(events)
else x).reset_index(drop=True)
ID FDAT LACT DATE EVENT
0 1 1/1/2022 1 31/1/2022 FRESH
1 1 1/1/2022 1 15/2/2022 LUT
2 1 1/1/2022 1 15/3/2022 BRED
3 1 1/1/2022 1 15/3/2022 OS
4 1 1/1/2022 1 30/3/2022 OS
5 1 1/1/2022 1 30/3/2022 PREG