mark duplicate as 0 in new column based on condition

Question

I have dataframe as below

data =[['a',96.21623993,1],
['a',99.88211060,1],
['b',99.90232849,1],
['b',99.91232849,1],
['b',99.91928864,1],
['c',99.89162445,1],
['d',99.95264435,1],
['a',99.82862091,2],
['a',99.84466553,2],
['b',99.89685059,2],
['c',78.10614777,2],
['c',97.73305511,2],
['d',95.42383575,2],
]

df = pd.DataFrame(data, columns=['ename','score', 'groupid'])
df

I need to mark duplicate as 0 in new column but NOT the one with highest score. and should be grouping on groupid and ename.

I am looking to get output as below. any help is really apricated

ename   score   groupid duplicate
    a   96.21624    1   TRUE
    a   99.882111   1   FALSE
    b   99.902328   1   TRUE
    b   99.912328   1   TRUE
    b   99.919289   1   FALSE
    c   99.891624   1   FALSE
    d   99.952644   1   FALSE
    a   99.828621   2   TRUE
    a   99.844666   2   FALSE
    b   99.896851   2   FALSE
    c   78.106148   2   TRUE
    c   97.733055   2   FALSE
    d   95.423836   2   FALSE

Asked By: tt0206

||

Source

Answer 1

I hope I’ve understood you right:

df['duplicate'] = True
df.loc[df.groupby(['groupid', 'ename'])['score'].idxmax().values, 'duplicate'] = False

print(df)

Prints:

   ename      score  groupid  duplicate
0      a  96.216240        1       True
1      a  99.882111        1      False
2      b  99.902328        1       True
3      b  99.912328        1       True
4      b  99.919289        1      False
5      c  99.891624        1      False
6      d  99.952644        1      False
7      a  99.828621        2       True
8      a  99.844666        2      False
9      b  99.896851        2      False
10     c  78.106148        2       True
11     c  97.733055        2      False
12     d  95.423836        2      False

Answered By: Andrej Kesely

Answer 2

This is fairly easy. We just need to mark out the max value with groupby. I think there’s an easier method than this as well but this should work.

import pandas as pd
import numpy as np

data =[['a',96.21623993,1],
['a',99.88211060,1],
['b',99.90232849,1],
['b',99.91232849,1],
['b',99.91928864,1],
['c',99.89162445,1],
['d',99.95264435,1],
['a',99.82862091,2],
['a',99.84466553,2],
['b',99.89685059,2],
['c',78.10614777,2],
['c',97.73305511,2],
['d',95.42383575,2],
]

df = pd.DataFrame(data, columns=['ename','score', 'groupid'])

df["temp"] = df.groupby(["ename","groupid"])["score"].transform("max")

   ename      score  groupid       temp
0      a  96.216240        1  99.882111
1      a  99.882111        1  99.882111
2      b  99.902328        1  99.919289
3      b  99.912328        1  99.919289
4      b  99.919289        1  99.919289
5      c  99.891624        1  99.891624
6      d  99.952644        1  99.952644
7      a  99.828621        2  99.844666
8      a  99.844666        2  99.844666
9      b  99.896851        2  99.896851
10     c  78.106148        2  97.733055
11     c  97.733055        2  97.733055
12     d  95.423836        2  95.423836

df["duplicate"] = 'TRUE'
df.loc[df.temp==df.score,"duplicate"] = 'FALSE'
df = df.drop(columns="temp")

Out[9]: 
   ename      score  groupid duplicate
0      a  96.216240        1      TRUE
1      a  99.882111        1     FALSE
2      b  99.902328        1      TRUE
3      b  99.912328        1      TRUE
4      b  99.919289        1     FALSE
5      c  99.891624        1     FALSE
6      d  99.952644        1     FALSE
7      a  99.828621        2      TRUE
8      a  99.844666        2     FALSE
9      b  99.896851        2     FALSE
10     c  78.106148        2      TRUE
11     c  97.733055        2     FALSE
12     d  95.423836        2     FALSE

Answered By: Vikram Raghu

mark duplicate as 0 in new column based on condition

Question:

Answers: