Drop non-unique values in a range of columns based on a condition from a different range of columns

Question:

This is a small part of a df.

In this case, I have 3 y-values I need to map: 0.933883, 97.658330 and 1.650013

I have this df

      x  y1  y2         y3         y4          d1  d2         d3         d4
23  5.3 NaN NaN   0.933883        NaN         NaN NaN   0.174866        NaN
25  5.3 NaN NaN        NaN  97.658330         NaN NaN        NaN   0.038670
26  5.3 NaN NaN   1.650013        NaN         NaN NaN   0.541264        NaN
29  5.3 NaN NaN  97.658330        NaN         NaN NaN  96.549581        NaN
30  5.3 NaN NaN        NaN   1.650013         NaN NaN        NaN  96.046987

There is not more than one of these values per column, I already dropped duplicates.

What I need:

I can not have the same value in more than one column.

The condition to choose which row to remove is as shown in this example:

There is 97.658330 in column y3 and y4. Since, for that value, d3(96.549581) is bigger than d4(0.038670), row 29 is removed.

There is 1.650013 in column y3 and y4. Since d4(96.046987) is bigger than d3(0.541264), row 30 is removed.

Output:

      x  y1  y2         y3         y4          d1  d2         d3         d4
23  5.3 NaN NaN   0.933883        NaN         NaN NaN   0.174866        NaN
25  5.3 NaN NaN        NaN  97.658330         NaN NaN        NaN   0.038670
26  5.3 NaN NaN   1.650013        NaN         NaN NaN   0.541264        NaN

P.S. There are a lot more values to map inside the complete data frame.

Asked By: Peter M

||

Answers:

There may be a more effective solution, but this works. First, let’s take the common values ​​in columns y3 and y4 as a list. Then find what are the values ​​of d3 and d4 while y3 and y4 take the common values ? (v1,v2)
. Finally Drop row by index number according to specified condition.

vals=sorted(list(df[['y3','y4']].stack()))
dupes = list(set(vals[::2]) & set(vals[1::2])) #https://stackoverflow.com/a/64956890/15415267
#dupes= [1.650013, 97.65833]

for i in dupes:
    v1=df[df['y3']==i]['d3'].iloc[0]
    v2=df[df['y4']==i]['d4'].iloc[0]
    if v1 > v2:
        df=df.drop(df[df['y3']==i]['d3'].index)
    else:
        df=df.drop(df[df['y4']==i]['d4'].index)
print(df)
'''
      x  y1  y2        y3        y4  d1  d2        d3       d4
23  5.3 NaN NaN  0.933883       NaN NaN NaN  0.174866      NaN
25  5.3 NaN NaN       NaN  97.65833 NaN NaN       NaN  0.03867
26  5.3 NaN NaN  1.650013       NaN NaN NaN  0.541264      NaN
'''
Answered By: Clegane

You can use:

y = df.filter(regex=r'yd+')
d = df.filter(regex=r'dd+')

# target = [0.933883, 97.658330, 1.650013]

# define the target values automatically
s = y.stack()
target = set(s[s.duplicated()])
# {1.650013, 97.65833}

drop = set()
for x in target:
    s = d.where(y.eq(x).to_numpy()).stack().droplevel(1)
    drop.update(s.index.difference([s.idxmin()]))

# drop is {29, 30}

out = df.drop(drop)

Output:

      x  y1  y2        y3        y4  d1  d2        d3       d4
23  5.3 NaN NaN  0.933883       NaN NaN NaN  0.174866      NaN
25  5.3 NaN NaN       NaN  97.65833 NaN NaN       NaN  0.03867
26  5.3 NaN NaN  1.650013       NaN NaN NaN  0.541264      NaN
Answered By: mozway
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.