Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if every column has the same duplicate

Question

This is another extension to my previous questions, Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise and Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if there are 3 or more duplicates

I have the following dataframe, (actually its around 7 million rows)

import pandas as pd

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [404.29, 75.21, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [404.29, 75.33, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [404.29, 75.33, np.nan],
        'ubp': [404.29, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

df = pd.DataFrame(data)

If there are all duplicates in my selection of columns, I want to to delete the duplicates and keep only 1 copy, if and only if every item in the selection is a duplicate.

Meaning if my selection has 4 columns, all 4 columns must have the same number for it to be considered a duplicate.

If only 2 or 3 of the selection of 4 have duplicates it does not count.

So in my example above, if my selection is, ['x6', 'y', 'uap', 'ubp'],the output should be,

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [404.29, 75.21, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [np.nan, 75.33, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [np.nan, 75.33, np.nan],
        'ubp': [np.nan, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

The second row should not be touched because one of the columns are different.

How can I achieve this?

Asked By: anarchy

||

Source

Answer 1

If you want to match all duplicates you can use:

selection = ['x6', 'y', 'uap', 'ubp']

# compare all values to the first one
m = df[selection].eq(df[selection[0]], axis=0)

# if all are duplicates, mask them except the first
df.loc[m.all(axis=1), selection[1:]] = np.nan

Output:

         date       x1          x2       x3        x4        x5        x6  x7   v      y  ay  by  cy  gy    uap    ubp   sf
0  2023-02-22  descx1a  ALSFNHF950      NaN       NaN       NaN    404.29 NaN NaN    NaN NaN NaN NaN NaN    NaN    NaN  NaN
1  2023-02-21  descx1b  KLUGUIF615      NaN       NaN       NaN     75.21 NaN NaN  75.33 NaN NaN NaN NaN  75.33  75.33  2.0
2  2023-02-23  descx1c         NaN  24319.4  24334.15  24040.11  24220.34 NaN NaN    NaN NaN NaN NaN NaN    NaN    NaN  NaN

Intermediates:

m
     x6      y    uap    ubp
0  True   True   True   True  # all True = duplicate
1  True  False  False  False
2  True  False  False  False

m.all(axis=1)
0     True
1    False
2    False
dtype: bool

precision

Note that if you have floating point values, seemingly identical values might not compare equal. In this case it might be safer to compute the mask with:

import numpy as np
m = np.isclose(df[selection], df[[selection[0]]])

Answered By: mozway

Answer 2

You can do:

selection = ['x6', 'y', 'uap', 'ubp']

#Here you see if all values across the selected columns are same
# if they are same the diff would be 0 in both directions and if you take all across columns it will be the row whose value should only be first value.
m = (df[selection].diff(axis='columns').eq(0) | 
     df[selection].diff(-1, axis='columns').eq(0)).all(1)

# Then select such rows you found by above mask and the columns other than the first one - assign them np.nan
df.loc[m, selection[1:]] = np.nan

Answered By: SomeDude

Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if every column has the same duplicate

Question:

Answers:

precision