Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if every column has the same duplicate
Question:
This is another extension to my previous questions, Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise and Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if there are 3 or more duplicates
I have the following dataframe, (actually its around 7 million rows)
import pandas as pd
data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
'x1': ['descx1a', 'descx1b', 'descx1c'],
'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
'x3': [np.nan, np.nan, 24319.4],
'x4': [np.nan, np.nan, 24334.15],
'x5': [np.nan, np.nan, 24040.11],
'x6': [404.29, 75.21, 24220.34],
'x7': [np.nan, np.nan, np.nan],
'v': [np.nan, np.nan, np.nan],
'y': [404.29, 75.33, np.nan],
'ay': [np.nan, np.nan, np.nan],
'by': [np.nan, np.nan, np.nan],
'cy': [np.nan, np.nan, np.nan],
'gy': [np.nan, np.nan, np.nan],
'uap': [404.29, 75.33, np.nan],
'ubp': [404.29, 75.33, np.nan],
'sf': [np.nan, 2.0, np.nan]}
df = pd.DataFrame(data)
If there are all duplicates in my selection of columns, I want to to delete the duplicates and keep only 1 copy, if and only if every item in the selection is a duplicate.
Meaning if my selection has 4 columns, all 4 columns must have the same number for it to be considered a duplicate.
If only 2 or 3 of the selection of 4 have duplicates it does not count.
So in my example above, if my selection is, ['x6', 'y', 'uap', 'ubp']
,the output should be,
data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
'x1': ['descx1a', 'descx1b', 'descx1c'],
'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
'x3': [np.nan, np.nan, 24319.4],
'x4': [np.nan, np.nan, 24334.15],
'x5': [np.nan, np.nan, 24040.11],
'x6': [404.29, 75.21, 24220.34],
'x7': [np.nan, np.nan, np.nan],
'v': [np.nan, np.nan, np.nan],
'y': [np.nan, 75.33, np.nan],
'ay': [np.nan, np.nan, np.nan],
'by': [np.nan, np.nan, np.nan],
'cy': [np.nan, np.nan, np.nan],
'gy': [np.nan, np.nan, np.nan],
'uap': [np.nan, 75.33, np.nan],
'ubp': [np.nan, 75.33, np.nan],
'sf': [np.nan, 2.0, np.nan]}
The second row should not be touched because one of the columns are different.
How can I achieve this?
Answers:
If you want to match all duplicates you can use:
selection = ['x6', 'y', 'uap', 'ubp']
# compare all values to the first one
m = df[selection].eq(df[selection[0]], axis=0)
# if all are duplicates, mask them except the first
df.loc[m.all(axis=1), selection[1:]] = np.nan
Output:
date x1 x2 x3 x4 x5 x6 x7 v y ay by cy gy uap ubp sf
0 2023-02-22 descx1a ALSFNHF950 NaN NaN NaN 404.29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2023-02-21 descx1b KLUGUIF615 NaN NaN NaN 75.21 NaN NaN 75.33 NaN NaN NaN NaN 75.33 75.33 2.0
2 2023-02-23 descx1c NaN 24319.4 24334.15 24040.11 24220.34 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Intermediates:
m
x6 y uap ubp
0 True True True True # all True = duplicate
1 True False False False
2 True False False False
m.all(axis=1)
0 True
1 False
2 False
dtype: bool
precision
Note that if you have floating point values, seemingly identical values might not compare equal. In this case it might be safer to compute the mask with:
import numpy as np
m = np.isclose(df[selection], df[[selection[0]]])
You can do:
selection = ['x6', 'y', 'uap', 'ubp']
#Here you see if all values across the selected columns are same
# if they are same the diff would be 0 in both directions and if you take all across columns it will be the row whose value should only be first value.
m = (df[selection].diff(axis='columns').eq(0) |
df[selection].diff(-1, axis='columns').eq(0)).all(1)
# Then select such rows you found by above mask and the columns other than the first one - assign them np.nan
df.loc[m, selection[1:]] = np.nan
This is another extension to my previous questions, Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise and Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if there are 3 or more duplicates
I have the following dataframe, (actually its around 7 million rows)
import pandas as pd
data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
'x1': ['descx1a', 'descx1b', 'descx1c'],
'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
'x3': [np.nan, np.nan, 24319.4],
'x4': [np.nan, np.nan, 24334.15],
'x5': [np.nan, np.nan, 24040.11],
'x6': [404.29, 75.21, 24220.34],
'x7': [np.nan, np.nan, np.nan],
'v': [np.nan, np.nan, np.nan],
'y': [404.29, 75.33, np.nan],
'ay': [np.nan, np.nan, np.nan],
'by': [np.nan, np.nan, np.nan],
'cy': [np.nan, np.nan, np.nan],
'gy': [np.nan, np.nan, np.nan],
'uap': [404.29, 75.33, np.nan],
'ubp': [404.29, 75.33, np.nan],
'sf': [np.nan, 2.0, np.nan]}
df = pd.DataFrame(data)
If there are all duplicates in my selection of columns, I want to to delete the duplicates and keep only 1 copy, if and only if every item in the selection is a duplicate.
Meaning if my selection has 4 columns, all 4 columns must have the same number for it to be considered a duplicate.
If only 2 or 3 of the selection of 4 have duplicates it does not count.
So in my example above, if my selection is, ['x6', 'y', 'uap', 'ubp']
,the output should be,
data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
'x1': ['descx1a', 'descx1b', 'descx1c'],
'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
'x3': [np.nan, np.nan, 24319.4],
'x4': [np.nan, np.nan, 24334.15],
'x5': [np.nan, np.nan, 24040.11],
'x6': [404.29, 75.21, 24220.34],
'x7': [np.nan, np.nan, np.nan],
'v': [np.nan, np.nan, np.nan],
'y': [np.nan, 75.33, np.nan],
'ay': [np.nan, np.nan, np.nan],
'by': [np.nan, np.nan, np.nan],
'cy': [np.nan, np.nan, np.nan],
'gy': [np.nan, np.nan, np.nan],
'uap': [np.nan, 75.33, np.nan],
'ubp': [np.nan, 75.33, np.nan],
'sf': [np.nan, 2.0, np.nan]}
The second row should not be touched because one of the columns are different.
How can I achieve this?
If you want to match all duplicates you can use:
selection = ['x6', 'y', 'uap', 'ubp']
# compare all values to the first one
m = df[selection].eq(df[selection[0]], axis=0)
# if all are duplicates, mask them except the first
df.loc[m.all(axis=1), selection[1:]] = np.nan
Output:
date x1 x2 x3 x4 x5 x6 x7 v y ay by cy gy uap ubp sf
0 2023-02-22 descx1a ALSFNHF950 NaN NaN NaN 404.29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2023-02-21 descx1b KLUGUIF615 NaN NaN NaN 75.21 NaN NaN 75.33 NaN NaN NaN NaN 75.33 75.33 2.0
2 2023-02-23 descx1c NaN 24319.4 24334.15 24040.11 24220.34 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Intermediates:
m
x6 y uap ubp
0 True True True True # all True = duplicate
1 True False False False
2 True False False False
m.all(axis=1)
0 True
1 False
2 False
dtype: bool
precision
Note that if you have floating point values, seemingly identical values might not compare equal. In this case it might be safer to compute the mask with:
import numpy as np
m = np.isclose(df[selection], df[[selection[0]]])
You can do:
selection = ['x6', 'y', 'uap', 'ubp']
#Here you see if all values across the selected columns are same
# if they are same the diff would be 0 in both directions and if you take all across columns it will be the row whose value should only be first value.
m = (df[selection].diff(axis='columns').eq(0) |
df[selection].diff(-1, axis='columns').eq(0)).all(1)
# Then select such rows you found by above mask and the columns other than the first one - assign them np.nan
df.loc[m, selection[1:]] = np.nan