Drop non-unique values in a range of columns based on a condition from a different range of columns
Question:
This is a small part of a df.
In this case, I have 3 y-values I need to map: 0.933883
, 97.658330
and 1.650013
I have this df
x y1 y2 y3 y4 d1 d2 d3 d4
23 5.3 NaN NaN 0.933883 NaN NaN NaN 0.174866 NaN
25 5.3 NaN NaN NaN 97.658330 NaN NaN NaN 0.038670
26 5.3 NaN NaN 1.650013 NaN NaN NaN 0.541264 NaN
29 5.3 NaN NaN 97.658330 NaN NaN NaN 96.549581 NaN
30 5.3 NaN NaN NaN 1.650013 NaN NaN NaN 96.046987
There is not more than one of these values per column, I already dropped duplicates.
What I need:
I can not have the same value in more than one column.
The condition to choose which row to remove is as shown in this example:
There is 97.658330
in column y3
and y4
. Since, for that value, d3
(96.549581) is bigger than d4
(0.038670), row 29
is removed.
There is 1.650013
in column y3
and y4
. Since d4
(96.046987) is bigger than d3
(0.541264), row 30
is removed.
Output:
x y1 y2 y3 y4 d1 d2 d3 d4
23 5.3 NaN NaN 0.933883 NaN NaN NaN 0.174866 NaN
25 5.3 NaN NaN NaN 97.658330 NaN NaN NaN 0.038670
26 5.3 NaN NaN 1.650013 NaN NaN NaN 0.541264 NaN
P.S. There are a lot more values to map inside the complete data frame.
Answers:
There may be a more effective solution, but this works. First, let’s take the common values in columns y3 and y4 as a list. Then find what are the values of d3 and d4 while y3 and y4 take the common values ? (v1,v2)
. Finally Drop row by index number according to specified condition.
vals=sorted(list(df[['y3','y4']].stack()))
dupes = list(set(vals[::2]) & set(vals[1::2])) #https://stackoverflow.com/a/64956890/15415267
#dupes= [1.650013, 97.65833]
for i in dupes:
v1=df[df['y3']==i]['d3'].iloc[0]
v2=df[df['y4']==i]['d4'].iloc[0]
if v1 > v2:
df=df.drop(df[df['y3']==i]['d3'].index)
else:
df=df.drop(df[df['y4']==i]['d4'].index)
print(df)
'''
x y1 y2 y3 y4 d1 d2 d3 d4
23 5.3 NaN NaN 0.933883 NaN NaN NaN 0.174866 NaN
25 5.3 NaN NaN NaN 97.65833 NaN NaN NaN 0.03867
26 5.3 NaN NaN 1.650013 NaN NaN NaN 0.541264 NaN
'''
You can use:
y = df.filter(regex=r'yd+')
d = df.filter(regex=r'dd+')
# target = [0.933883, 97.658330, 1.650013]
# define the target values automatically
s = y.stack()
target = set(s[s.duplicated()])
# {1.650013, 97.65833}
drop = set()
for x in target:
s = d.where(y.eq(x).to_numpy()).stack().droplevel(1)
drop.update(s.index.difference([s.idxmin()]))
# drop is {29, 30}
out = df.drop(drop)
Output:
x y1 y2 y3 y4 d1 d2 d3 d4
23 5.3 NaN NaN 0.933883 NaN NaN NaN 0.174866 NaN
25 5.3 NaN NaN NaN 97.65833 NaN NaN NaN 0.03867
26 5.3 NaN NaN 1.650013 NaN NaN NaN 0.541264 NaN
This is a small part of a df.
In this case, I have 3 y-values I need to map: 0.933883
, 97.658330
and 1.650013
I have this df
x y1 y2 y3 y4 d1 d2 d3 d4
23 5.3 NaN NaN 0.933883 NaN NaN NaN 0.174866 NaN
25 5.3 NaN NaN NaN 97.658330 NaN NaN NaN 0.038670
26 5.3 NaN NaN 1.650013 NaN NaN NaN 0.541264 NaN
29 5.3 NaN NaN 97.658330 NaN NaN NaN 96.549581 NaN
30 5.3 NaN NaN NaN 1.650013 NaN NaN NaN 96.046987
There is not more than one of these values per column, I already dropped duplicates.
What I need:
I can not have the same value in more than one column.
The condition to choose which row to remove is as shown in this example:
There is 97.658330
in column y3
and y4
. Since, for that value, d3
(96.549581) is bigger than d4
(0.038670), row 29
is removed.
There is 1.650013
in column y3
and y4
. Since d4
(96.046987) is bigger than d3
(0.541264), row 30
is removed.
Output:
x y1 y2 y3 y4 d1 d2 d3 d4
23 5.3 NaN NaN 0.933883 NaN NaN NaN 0.174866 NaN
25 5.3 NaN NaN NaN 97.658330 NaN NaN NaN 0.038670
26 5.3 NaN NaN 1.650013 NaN NaN NaN 0.541264 NaN
P.S. There are a lot more values to map inside the complete data frame.
There may be a more effective solution, but this works. First, let’s take the common values in columns y3 and y4 as a list. Then find what are the values of d3 and d4 while y3 and y4 take the common values ? (v1,v2)
. Finally Drop row by index number according to specified condition.
vals=sorted(list(df[['y3','y4']].stack()))
dupes = list(set(vals[::2]) & set(vals[1::2])) #https://stackoverflow.com/a/64956890/15415267
#dupes= [1.650013, 97.65833]
for i in dupes:
v1=df[df['y3']==i]['d3'].iloc[0]
v2=df[df['y4']==i]['d4'].iloc[0]
if v1 > v2:
df=df.drop(df[df['y3']==i]['d3'].index)
else:
df=df.drop(df[df['y4']==i]['d4'].index)
print(df)
'''
x y1 y2 y3 y4 d1 d2 d3 d4
23 5.3 NaN NaN 0.933883 NaN NaN NaN 0.174866 NaN
25 5.3 NaN NaN NaN 97.65833 NaN NaN NaN 0.03867
26 5.3 NaN NaN 1.650013 NaN NaN NaN 0.541264 NaN
'''
You can use:
y = df.filter(regex=r'yd+')
d = df.filter(regex=r'dd+')
# target = [0.933883, 97.658330, 1.650013]
# define the target values automatically
s = y.stack()
target = set(s[s.duplicated()])
# {1.650013, 97.65833}
drop = set()
for x in target:
s = d.where(y.eq(x).to_numpy()).stack().droplevel(1)
drop.update(s.index.difference([s.idxmin()]))
# drop is {29, 30}
out = df.drop(drop)
Output:
x y1 y2 y3 y4 d1 d2 d3 d4
23 5.3 NaN NaN 0.933883 NaN NaN NaN 0.174866 NaN
25 5.3 NaN NaN NaN 97.65833 NaN NaN NaN 0.03867
26 5.3 NaN NaN 1.650013 NaN NaN NaN 0.541264 NaN