mark duplicate as 0 in new column based on condition
Question:
I have dataframe as below
data =[['a',96.21623993,1],
['a',99.88211060,1],
['b',99.90232849,1],
['b',99.91232849,1],
['b',99.91928864,1],
['c',99.89162445,1],
['d',99.95264435,1],
['a',99.82862091,2],
['a',99.84466553,2],
['b',99.89685059,2],
['c',78.10614777,2],
['c',97.73305511,2],
['d',95.42383575,2],
]
df = pd.DataFrame(data, columns=['ename','score', 'groupid'])
df
I need to mark duplicate as 0 in new column but NOT the one with highest score. and should be grouping on groupid and ename.
I am looking to get output as below. any help is really apricated
ename score groupid duplicate
a 96.21624 1 TRUE
a 99.882111 1 FALSE
b 99.902328 1 TRUE
b 99.912328 1 TRUE
b 99.919289 1 FALSE
c 99.891624 1 FALSE
d 99.952644 1 FALSE
a 99.828621 2 TRUE
a 99.844666 2 FALSE
b 99.896851 2 FALSE
c 78.106148 2 TRUE
c 97.733055 2 FALSE
d 95.423836 2 FALSE
Answers:
I hope I’ve understood you right:
df['duplicate'] = True
df.loc[df.groupby(['groupid', 'ename'])['score'].idxmax().values, 'duplicate'] = False
print(df)
Prints:
ename score groupid duplicate
0 a 96.216240 1 True
1 a 99.882111 1 False
2 b 99.902328 1 True
3 b 99.912328 1 True
4 b 99.919289 1 False
5 c 99.891624 1 False
6 d 99.952644 1 False
7 a 99.828621 2 True
8 a 99.844666 2 False
9 b 99.896851 2 False
10 c 78.106148 2 True
11 c 97.733055 2 False
12 d 95.423836 2 False
This is fairly easy. We just need to mark out the max value with groupby. I think there’s an easier method than this as well but this should work.
import pandas as pd
import numpy as np
data =[['a',96.21623993,1],
['a',99.88211060,1],
['b',99.90232849,1],
['b',99.91232849,1],
['b',99.91928864,1],
['c',99.89162445,1],
['d',99.95264435,1],
['a',99.82862091,2],
['a',99.84466553,2],
['b',99.89685059,2],
['c',78.10614777,2],
['c',97.73305511,2],
['d',95.42383575,2],
]
df = pd.DataFrame(data, columns=['ename','score', 'groupid'])
df["temp"] = df.groupby(["ename","groupid"])["score"].transform("max")
ename score groupid temp
0 a 96.216240 1 99.882111
1 a 99.882111 1 99.882111
2 b 99.902328 1 99.919289
3 b 99.912328 1 99.919289
4 b 99.919289 1 99.919289
5 c 99.891624 1 99.891624
6 d 99.952644 1 99.952644
7 a 99.828621 2 99.844666
8 a 99.844666 2 99.844666
9 b 99.896851 2 99.896851
10 c 78.106148 2 97.733055
11 c 97.733055 2 97.733055
12 d 95.423836 2 95.423836
df["duplicate"] = 'TRUE'
df.loc[df.temp==df.score,"duplicate"] = 'FALSE'
df = df.drop(columns="temp")
Out[9]:
ename score groupid duplicate
0 a 96.216240 1 TRUE
1 a 99.882111 1 FALSE
2 b 99.902328 1 TRUE
3 b 99.912328 1 TRUE
4 b 99.919289 1 FALSE
5 c 99.891624 1 FALSE
6 d 99.952644 1 FALSE
7 a 99.828621 2 TRUE
8 a 99.844666 2 FALSE
9 b 99.896851 2 FALSE
10 c 78.106148 2 TRUE
11 c 97.733055 2 FALSE
12 d 95.423836 2 FALSE
I have dataframe as below
data =[['a',96.21623993,1],
['a',99.88211060,1],
['b',99.90232849,1],
['b',99.91232849,1],
['b',99.91928864,1],
['c',99.89162445,1],
['d',99.95264435,1],
['a',99.82862091,2],
['a',99.84466553,2],
['b',99.89685059,2],
['c',78.10614777,2],
['c',97.73305511,2],
['d',95.42383575,2],
]
df = pd.DataFrame(data, columns=['ename','score', 'groupid'])
df
I need to mark duplicate as 0 in new column but NOT the one with highest score. and should be grouping on groupid and ename.
I am looking to get output as below. any help is really apricated
ename score groupid duplicate
a 96.21624 1 TRUE
a 99.882111 1 FALSE
b 99.902328 1 TRUE
b 99.912328 1 TRUE
b 99.919289 1 FALSE
c 99.891624 1 FALSE
d 99.952644 1 FALSE
a 99.828621 2 TRUE
a 99.844666 2 FALSE
b 99.896851 2 FALSE
c 78.106148 2 TRUE
c 97.733055 2 FALSE
d 95.423836 2 FALSE
I hope I’ve understood you right:
df['duplicate'] = True
df.loc[df.groupby(['groupid', 'ename'])['score'].idxmax().values, 'duplicate'] = False
print(df)
Prints:
ename score groupid duplicate
0 a 96.216240 1 True
1 a 99.882111 1 False
2 b 99.902328 1 True
3 b 99.912328 1 True
4 b 99.919289 1 False
5 c 99.891624 1 False
6 d 99.952644 1 False
7 a 99.828621 2 True
8 a 99.844666 2 False
9 b 99.896851 2 False
10 c 78.106148 2 True
11 c 97.733055 2 False
12 d 95.423836 2 False
This is fairly easy. We just need to mark out the max value with groupby. I think there’s an easier method than this as well but this should work.
import pandas as pd
import numpy as np
data =[['a',96.21623993,1],
['a',99.88211060,1],
['b',99.90232849,1],
['b',99.91232849,1],
['b',99.91928864,1],
['c',99.89162445,1],
['d',99.95264435,1],
['a',99.82862091,2],
['a',99.84466553,2],
['b',99.89685059,2],
['c',78.10614777,2],
['c',97.73305511,2],
['d',95.42383575,2],
]
df = pd.DataFrame(data, columns=['ename','score', 'groupid'])
df["temp"] = df.groupby(["ename","groupid"])["score"].transform("max")
ename score groupid temp
0 a 96.216240 1 99.882111
1 a 99.882111 1 99.882111
2 b 99.902328 1 99.919289
3 b 99.912328 1 99.919289
4 b 99.919289 1 99.919289
5 c 99.891624 1 99.891624
6 d 99.952644 1 99.952644
7 a 99.828621 2 99.844666
8 a 99.844666 2 99.844666
9 b 99.896851 2 99.896851
10 c 78.106148 2 97.733055
11 c 97.733055 2 97.733055
12 d 95.423836 2 95.423836
df["duplicate"] = 'TRUE'
df.loc[df.temp==df.score,"duplicate"] = 'FALSE'
df = df.drop(columns="temp")
Out[9]:
ename score groupid duplicate
0 a 96.216240 1 TRUE
1 a 99.882111 1 FALSE
2 b 99.902328 1 TRUE
3 b 99.912328 1 TRUE
4 b 99.919289 1 FALSE
5 c 99.891624 1 FALSE
6 d 99.952644 1 FALSE
7 a 99.828621 2 TRUE
8 a 99.844666 2 FALSE
9 b 99.896851 2 FALSE
10 c 78.106148 2 TRUE
11 c 97.733055 2 FALSE
12 d 95.423836 2 FALSE