# Drop non-unique values in a range of columns based on a condition from a different range of columns

## Question:

This is a small part of a df.

In this case, I have 3 y-values I need to map: `0.933883`, `97.658330` and `1.650013`

I have this `df`

``````      x  y1  y2         y3         y4          d1  d2         d3         d4
23  5.3 NaN NaN   0.933883        NaN         NaN NaN   0.174866        NaN
25  5.3 NaN NaN        NaN  97.658330         NaN NaN        NaN   0.038670
26  5.3 NaN NaN   1.650013        NaN         NaN NaN   0.541264        NaN
29  5.3 NaN NaN  97.658330        NaN         NaN NaN  96.549581        NaN
30  5.3 NaN NaN        NaN   1.650013         NaN NaN        NaN  96.046987
``````

There is not more than one of these values per column, I already dropped duplicates.

What I need:

I can not have the same value in more than one column.

The condition to choose which row to remove is as shown in this example:

There is `97.658330` in column `y3` and `y4`. Since, for that value, `d3`(96.549581) is bigger than `d4`(0.038670), row `29` is removed.

There is `1.650013` in column `y3` and `y4`. Since `d4`(96.046987) is bigger than `d3`(0.541264), row `30` is removed.

Output:

``````      x  y1  y2         y3         y4          d1  d2         d3         d4
23  5.3 NaN NaN   0.933883        NaN         NaN NaN   0.174866        NaN
25  5.3 NaN NaN        NaN  97.658330         NaN NaN        NaN   0.038670
26  5.3 NaN NaN   1.650013        NaN         NaN NaN   0.541264        NaN
``````

P.S. There are a lot more values to map inside the complete data frame.

There may be a more effective solution, but this works. First, let’s take the common values ​​in columns y3 and y4 as a list. Then find what are the values ​​of d3 and d4 while y3 and y4 take the common values ? (v1,v2)
. Finally Drop row by index number according to specified condition.

``````vals=sorted(list(df[['y3','y4']].stack()))
dupes = list(set(vals[::2]) & set(vals[1::2])) #https://stackoverflow.com/a/64956890/15415267
#dupes= [1.650013, 97.65833]

for i in dupes:
v1=df[df['y3']==i]['d3'].iloc[0]
v2=df[df['y4']==i]['d4'].iloc[0]
if v1 > v2:
df=df.drop(df[df['y3']==i]['d3'].index)
else:
df=df.drop(df[df['y4']==i]['d4'].index)
print(df)
'''
x  y1  y2        y3        y4  d1  d2        d3       d4
23  5.3 NaN NaN  0.933883       NaN NaN NaN  0.174866      NaN
25  5.3 NaN NaN       NaN  97.65833 NaN NaN       NaN  0.03867
26  5.3 NaN NaN  1.650013       NaN NaN NaN  0.541264      NaN
'''
``````

You can use:

``````y = df.filter(regex=r'yd+')
d = df.filter(regex=r'dd+')

# target = [0.933883, 97.658330, 1.650013]

# define the target values automatically
s = y.stack()
target = set(s[s.duplicated()])
# {1.650013, 97.65833}

drop = set()
for x in target:
s = d.where(y.eq(x).to_numpy()).stack().droplevel(1)
drop.update(s.index.difference([s.idxmin()]))

# drop is {29, 30}

out = df.drop(drop)
``````

Output:

``````      x  y1  y2        y3        y4  d1  d2        d3       d4
23  5.3 NaN NaN  0.933883       NaN NaN NaN  0.174866      NaN
25  5.3 NaN NaN       NaN  97.65833 NaN NaN       NaN  0.03867
26  5.3 NaN NaN  1.650013       NaN NaN NaN  0.541264      NaN
``````
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.