Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if every column has the same duplicate

Question:

This is another extension to my previous questions, Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise and Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise only if there are 3 or more duplicates

I have the following dataframe, (actually its around 7 million rows)

import pandas as pd

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [404.29, 75.21, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [404.29, 75.33, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [404.29, 75.33, np.nan],
        'ubp': [404.29, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

df = pd.DataFrame(data)

If there are all duplicates in my selection of columns, I want to to delete the duplicates and keep only 1 copy, if and only if every item in the selection is a duplicate.

Meaning if my selection has 4 columns, all 4 columns must have the same number for it to be considered a duplicate.

If only 2 or 3 of the selection of 4 have duplicates it does not count.

So in my example above, if my selection is, ['x6', 'y', 'uap', 'ubp'],the output should be,

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [404.29, 75.21, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [np.nan, 75.33, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [np.nan, 75.33, np.nan],
        'ubp': [np.nan, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

The second row should not be touched because one of the columns are different.

How can I achieve this?

Asked By: anarchy

||

Answers:

If you want to match all duplicates you can use:

selection = ['x6', 'y', 'uap', 'ubp']

# compare all values to the first one
m = df[selection].eq(df[selection[0]], axis=0)

# if all are duplicates, mask them except the first
df.loc[m.all(axis=1), selection[1:]] = np.nan

Output:

         date       x1          x2       x3        x4        x5        x6  x7   v      y  ay  by  cy  gy    uap    ubp   sf
0  2023-02-22  descx1a  ALSFNHF950      NaN       NaN       NaN    404.29 NaN NaN    NaN NaN NaN NaN NaN    NaN    NaN  NaN
1  2023-02-21  descx1b  KLUGUIF615      NaN       NaN       NaN     75.21 NaN NaN  75.33 NaN NaN NaN NaN  75.33  75.33  2.0
2  2023-02-23  descx1c         NaN  24319.4  24334.15  24040.11  24220.34 NaN NaN    NaN NaN NaN NaN NaN    NaN    NaN  NaN

Intermediates:

m
     x6      y    uap    ubp
0  True   True   True   True  # all True = duplicate
1  True  False  False  False
2  True  False  False  False

m.all(axis=1)
0     True
1    False
2    False
dtype: bool

precision

Note that if you have floating point values, seemingly identical values might not compare equal. In this case it might be safer to compute the mask with:

import numpy as np
m = np.isclose(df[selection], df[[selection[0]]])
Answered By: mozway

You can do:

selection = ['x6', 'y', 'uap', 'ubp']

#Here you see if all values across the selected columns are same
# if they are same the diff would be 0 in both directions and if you take all across columns it will be the row whose value should only be first value.
m = (df[selection].diff(axis='columns').eq(0) | 
     df[selection].diff(-1, axis='columns').eq(0)).all(1)

# Then select such rows you found by above mask and the columns other than the first one - assign them np.nan
df.loc[m, selection[1:]] = np.nan
Answered By: SomeDude
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.