Extract row with maximum value in a group pandas dataframe
Question:
A similar question is asked here:
Python : Getting the Row which has the max value in groups using groupby
However, I just need one record per group even if there are more than one record with maximum value in that group.
In the example below, I need one record for “s2”. For me it doesn’t matter which one.
>>> df = DataFrame({'Sp':['a','b','c','d','e','f'], 'Mt':['s1', 's1', 's2','s2','s2','s3'], 'Value':[1,2,3,4,5,6], 'count':[3,2,5,10,10,6]})
>>> df
Mt Sp Value count
0 s1 a 1 3
1 s1 b 2 2
2 s2 c 3 5
3 s2 d 4 10
4 s2 e 5 10
5 s3 f 6 6
>>> idx = df.groupby(['Mt'])['count'].transform(max) == df['count']
>>> df[idx]
Mt Sp Value count
0 s1 a 1 3
3 s2 d 4 10
4 s2 e 5 10
5 s3 f 6 6
>>>
Answers:
You can use first
In [14]: df.groupby('Mt').first()
Out[14]:
Sp Value count
Mt
s1 a 1 3
s2 c 3 5
s3 f 6 6
Update
Set as_index=False
to achieve your goal
In [28]: df.groupby('Mt', as_index=False).first()
Out[28]:
Mt Sp Value count
0 s1 a 1 3
1 s2 c 3 5
2 s3 f 6 6
Update Again
Sorry for misunderstanding what you mean. You can sort it first if you want the one with max count in a group
In [196]: df.sort('count', ascending=False).groupby('Mt', as_index=False).first()
Out[196]:
Mt Sp Value count
0 s1 a 1 3
1 s2 e 5 10
2 s3 f 6 6
To get first occurence of maximum count
you can use pandas.DataFrame.idxmax() function:
>>> df.iloc[df.groupby(['Mt']).apply(lambda x: x['count'].idxmax())]
Mt Sp Value count
0 s1 a 1 3
3 s2 d 4 10
5 s3 f 6 6
Playing off of Roman Pekar’s answer, I found that that the following code would work:
from math import isnan
df.iloc[[int(x) for x in df.groupby(by=df.Mt).apply(lambda x: x['count'].idxmax()).values if not isnan(y)]]
Note the isnan condition, as my application had some nan entries in the column we are maximizing over.
The answers already given don’t show clearly what’s by far the fastest option.
Sort by the row where you want the max value, and then drop duplicates (takes as parameter the name of the rows to take into account for evaluating duplicates)
df.sort_values('count', ascending=False).drop_duplicates(['Mt'])
NB : Yes that answer is already given in a comment on the question but it’s very easy to miss it. And it will be up to 10 times faster as groupby.
A similar question is asked here:
Python : Getting the Row which has the max value in groups using groupby
However, I just need one record per group even if there are more than one record with maximum value in that group.
In the example below, I need one record for “s2”. For me it doesn’t matter which one.
>>> df = DataFrame({'Sp':['a','b','c','d','e','f'], 'Mt':['s1', 's1', 's2','s2','s2','s3'], 'Value':[1,2,3,4,5,6], 'count':[3,2,5,10,10,6]})
>>> df
Mt Sp Value count
0 s1 a 1 3
1 s1 b 2 2
2 s2 c 3 5
3 s2 d 4 10
4 s2 e 5 10
5 s3 f 6 6
>>> idx = df.groupby(['Mt'])['count'].transform(max) == df['count']
>>> df[idx]
Mt Sp Value count
0 s1 a 1 3
3 s2 d 4 10
4 s2 e 5 10
5 s3 f 6 6
>>>
You can use first
In [14]: df.groupby('Mt').first()
Out[14]:
Sp Value count
Mt
s1 a 1 3
s2 c 3 5
s3 f 6 6
Update
Set as_index=False
to achieve your goal
In [28]: df.groupby('Mt', as_index=False).first()
Out[28]:
Mt Sp Value count
0 s1 a 1 3
1 s2 c 3 5
2 s3 f 6 6
Update Again
Sorry for misunderstanding what you mean. You can sort it first if you want the one with max count in a group
In [196]: df.sort('count', ascending=False).groupby('Mt', as_index=False).first()
Out[196]:
Mt Sp Value count
0 s1 a 1 3
1 s2 e 5 10
2 s3 f 6 6
To get first occurence of maximum count
you can use pandas.DataFrame.idxmax() function:
>>> df.iloc[df.groupby(['Mt']).apply(lambda x: x['count'].idxmax())]
Mt Sp Value count
0 s1 a 1 3
3 s2 d 4 10
5 s3 f 6 6
Playing off of Roman Pekar’s answer, I found that that the following code would work:
from math import isnan
df.iloc[[int(x) for x in df.groupby(by=df.Mt).apply(lambda x: x['count'].idxmax()).values if not isnan(y)]]
Note the isnan condition, as my application had some nan entries in the column we are maximizing over.
The answers already given don’t show clearly what’s by far the fastest option.
Sort by the row where you want the max value, and then drop duplicates (takes as parameter the name of the rows to take into account for evaluating duplicates)
df.sort_values('count', ascending=False).drop_duplicates(['Mt'])
NB : Yes that answer is already given in a comment on the question but it’s very easy to miss it. And it will be up to 10 times faster as groupby.