Pandas Groupby and Compare rows to find maximum value
Question:
I’ve a dataframe
a
b
c
one
6
11
one
7
12
two
8
23
two
9
14
three
10
15
three
20
25
I want to apply groupby at column a
and then find the highest value in column c
, so that, the highest value gets flagged, i.e.
a
b
c
one
6
11
one
7
12
Compare value 11&12, then
a
b
c
two
8
23
two
9
14
Compare value 23&14, then
a
b
c
three
10
15
three
20
25
Finally resulting in:
a
b
c
flag
one
6
11
no
one
7
12
yes
two
8
23
yes
two
9
14
no
three
10
15
no
three
20
25
yes
I/P DF:
df = pd.DataFrame({
'a':["one","one","two","two","three","three"]
, 'b':[6,7,8,9,10,20]
, 'c':[11,12,23,14,15,25]
# , 'flag': ['no', 'yes', 'yes', 'no', 'no', 'yes']
})
df
Answers:
You can use groupby.transform
to get the max value per group, and numpy.where
to map the True
/False
to 'yes'
/'no'
:
df['flag'] = np.where(df.groupby('a')['c'].transform('max').eq(df['c']), 'yes', 'no')
output:
a b c flag
0 one 6 11 no
1 one 7 12 yes
2 two 8 23 yes
3 two 9 14 no
4 three 10 15 no
5 three 20 25 yes
Intermediates:
df.groupby('a')['c'].transform('max')
0 12
1 12
2 23
3 23
4 25
5 25
Name: c, dtype: int64
df.groupby('a')['c'].transform('max').eq(df['c'])
0 False
1 True
2 True
3 False
4 False
5 True
Name: c, dtype: bool
Use GroupBy.transform
with max
, comapre to same column c
and then set yes/no
in numpy.where
:
df['flag'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
print(df)
a b c flag
0 one 6 11 no
1 one 7 12 yes
2 two 8 23 yes
3 two 9 14 no
4 three 10 15 no
5 three 20 25 yes
If multiple values per a
with maximal values get multiple yes
, if need only first maximal values use DataFrameGroupBy.idxmax
and compare df.index
:
df = pd.DataFrame({
'a':["one","one","one","two","three","three"]
, 'b':[6,7,8,9,10,20]
, 'c':[11,12,12,14,15,25]
})
df['flag1'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
df['flag2'] = np.where(df.index == df.groupby('a')['c'].transform('idxmax'), 'yes', 'no')
print(df)
a b c flag1 flag2
0 one 6 11 no no
1 one 7 12 yes yes
2 one 8 12 yes no <- difference for match all max or first max
3 two 9 14 yes yes
4 three 10 15 no no
5 three 20 25 yes yes
One way to do that is as follows
df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df.groupby('a')['c'].max().values and x['a'] == df.groupby('c')['a'].max().loc[x['c']] else 'no', axis=1)
a b c flag
0 one 6 11 no
1 one 7 12 yes
2 two 8 23 yes
3 two 9 14 no
4 three 10 15 no
5 three 20 25 yes
Breaking down the various steps that one is doing above
-
df['flag']
creates the new column named flag
.
-
df.groupby('a')['c'].max()
will group by column a
, with pandas.DataFrame.groupby
, and find the highest value in column c
.
df2 = df.groupby('a')['c'].max()
-
Then we check if the value is in the dataframe generated in step 2 AND if the group is the same.
df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df2.values and x['a'] == df2.loc[x['c']] else 'no', axis=1)
Notes:
-
Checking if the group is the same is key, else, even though it was working for this specific case, it wouldn’t work if a group had a non-max value that was the max value of another group (as mozway mentioned).
-
As indicated in the answer that jezrael shared, .apply
can be slow and, even though does the work, it might not be the most convenient way to do that.
I’ve a dataframe
a | b | c |
---|---|---|
one | 6 | 11 |
one | 7 | 12 |
two | 8 | 23 |
two | 9 | 14 |
three | 10 | 15 |
three | 20 | 25 |
I want to apply groupby at column a
and then find the highest value in column c
, so that, the highest value gets flagged, i.e.
a | b | c |
---|---|---|
one | 6 | 11 |
one | 7 | 12 |
Compare value 11&12, then
a | b | c |
---|---|---|
two | 8 | 23 |
two | 9 | 14 |
Compare value 23&14, then
a | b | c |
---|---|---|
three | 10 | 15 |
three | 20 | 25 |
Finally resulting in:
a | b | c | flag |
---|---|---|---|
one | 6 | 11 | no |
one | 7 | 12 | yes |
two | 8 | 23 | yes |
two | 9 | 14 | no |
three | 10 | 15 | no |
three | 20 | 25 | yes |
I/P DF:
df = pd.DataFrame({
'a':["one","one","two","two","three","three"]
, 'b':[6,7,8,9,10,20]
, 'c':[11,12,23,14,15,25]
# , 'flag': ['no', 'yes', 'yes', 'no', 'no', 'yes']
})
df
You can use groupby.transform
to get the max value per group, and numpy.where
to map the True
/False
to 'yes'
/'no'
:
df['flag'] = np.where(df.groupby('a')['c'].transform('max').eq(df['c']), 'yes', 'no')
output:
a b c flag
0 one 6 11 no
1 one 7 12 yes
2 two 8 23 yes
3 two 9 14 no
4 three 10 15 no
5 three 20 25 yes
Intermediates:
df.groupby('a')['c'].transform('max')
0 12
1 12
2 23
3 23
4 25
5 25
Name: c, dtype: int64
df.groupby('a')['c'].transform('max').eq(df['c'])
0 False
1 True
2 True
3 False
4 False
5 True
Name: c, dtype: bool
Use GroupBy.transform
with max
, comapre to same column c
and then set yes/no
in numpy.where
:
df['flag'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
print(df)
a b c flag
0 one 6 11 no
1 one 7 12 yes
2 two 8 23 yes
3 two 9 14 no
4 three 10 15 no
5 three 20 25 yes
If multiple values per a
with maximal values get multiple yes
, if need only first maximal values use DataFrameGroupBy.idxmax
and compare df.index
:
df = pd.DataFrame({
'a':["one","one","one","two","three","three"]
, 'b':[6,7,8,9,10,20]
, 'c':[11,12,12,14,15,25]
})
df['flag1'] = np.where(df.c.eq(df.groupby('a')['c'].transform('max')), 'yes', 'no')
df['flag2'] = np.where(df.index == df.groupby('a')['c'].transform('idxmax'), 'yes', 'no')
print(df)
a b c flag1 flag2
0 one 6 11 no no
1 one 7 12 yes yes
2 one 8 12 yes no <- difference for match all max or first max
3 two 9 14 yes yes
4 three 10 15 no no
5 three 20 25 yes yes
One way to do that is as follows
df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df.groupby('a')['c'].max().values and x['a'] == df.groupby('c')['a'].max().loc[x['c']] else 'no', axis=1)
a b c flag
0 one 6 11 no
1 one 7 12 yes
2 two 8 23 yes
3 two 9 14 no
4 three 10 15 no
5 three 20 25 yes
Breaking down the various steps that one is doing above
-
df['flag']
creates the new column namedflag
. -
df.groupby('a')['c'].max()
will group by columna
, withpandas.DataFrame.groupby
, and find the highest value in columnc
.df2 = df.groupby('a')['c'].max()
-
Then we check if the value is in the dataframe generated in step 2 AND if the group is the same.
df['flag'] = df.apply(lambda x: 'yes' if x['c'] in df2.values and x['a'] == df2.loc[x['c']] else 'no', axis=1)
Notes:
-
Checking if the group is the same is key, else, even though it was working for this specific case, it wouldn’t work if a group had a non-max value that was the max value of another group (as mozway mentioned).
-
As indicated in the answer that jezrael shared,
.apply
can be slow and, even though does the work, it might not be the most convenient way to do that.