Get the max value from each group with pandas.DataFrame.groupby
Question:
I need to aggregate two columns of my dataframe, count the values of the second columns and then take only the row with the highest value in the "count" column, let me show:
df =
col1|col2
---------
A | AX
A | AX
A | AY
A | AY
A | AY
B | BX
B | BX
B | BX
B | BY
B | BY
C | CX
C | CX
C | CX
C | CX
C | CX
------------
df1 = df.groupby(['col1', 'col2']).agg({'col2': 'count'})
df1.columns = ['count']
df1= df1.reset_index()
out:
col1 col2 count
A AX 2
A AY 3
B BX 3
B BY 2
C CX 5
so far so good, but now I need to get only the row of each ‘col1’ group that has the maximum ‘count’ value, but keeping the value in ‘col2’.
expected output in the end:
col1 col2 count
A AY 3
B BX 3
C CX 5
I have no idea how to do that. My attempts so far of using the max() aggregation always left the ‘col2’ out.
Answers:
From your original DataFrame you can .value_counts
, which returns a descending count within group, and then given this sorting drop_duplicates
will keep the most frequent within group.
df1 = (df.groupby('col1')['col2'].value_counts()
.rename('counts').reset_index()
.drop_duplicates('col1'))
col1 col2 counts
0 A AY 3
2 B BX 3
4 C CX 5
I guess you need this: df[‘qty’] = 1 and then df.groupby([[‘col1’, ‘col2’]].sum().reset_index(drop=True)
Probably not ideal, but this works:
df1.loc[df1.groupby(level=0).idxmax()['count']]
col1 col2 count
A AY 3
B BX 3
C CX 5
This works because the groupby within the loc
will return a list of indices, which loc
will then pull up.
Option 1: Include Ties
In case you have ties and want to show them.
Ties could be, for instance, both (B, BX) and (B, BY) occur 3 times.
# Prepare packages
import pandas as pd
# Create dummy date
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'col2': ['AX', 'AX', 'AY', 'AY', 'AY', 'BX', 'BX', 'BX', 'BY', 'BY', 'BY', 'CX', 'CX', 'CX', 'CX', 'CX'],
})
# Get Max Value by Group with Ties
df_count = (df.groupby('col1', as_index=0)['col2'].value_counts())
m = df_count.groupby(['col1'])['count'].transform(max) == df_count['count']
df1 = df_count[m]
col1 col2 count
0 A AY 3
2 B BX 3
3 B BY 3
4 C CX 5
Option 2: Short Code Ignoring Ties
df1 = (df
.groupby('col1')['col2']
.value_counts()
.groupby(level=0)
.head(1)
# .to_frame('count').reset_index() # Uncomment to get exact output requested
)
I need to aggregate two columns of my dataframe, count the values of the second columns and then take only the row with the highest value in the "count" column, let me show:
df =
col1|col2
---------
A | AX
A | AX
A | AY
A | AY
A | AY
B | BX
B | BX
B | BX
B | BY
B | BY
C | CX
C | CX
C | CX
C | CX
C | CX
------------
df1 = df.groupby(['col1', 'col2']).agg({'col2': 'count'})
df1.columns = ['count']
df1= df1.reset_index()
out:
col1 col2 count
A AX 2
A AY 3
B BX 3
B BY 2
C CX 5
so far so good, but now I need to get only the row of each ‘col1’ group that has the maximum ‘count’ value, but keeping the value in ‘col2’.
expected output in the end:
col1 col2 count
A AY 3
B BX 3
C CX 5
I have no idea how to do that. My attempts so far of using the max() aggregation always left the ‘col2’ out.
From your original DataFrame you can .value_counts
, which returns a descending count within group, and then given this sorting drop_duplicates
will keep the most frequent within group.
df1 = (df.groupby('col1')['col2'].value_counts()
.rename('counts').reset_index()
.drop_duplicates('col1'))
col1 col2 counts
0 A AY 3
2 B BX 3
4 C CX 5
I guess you need this: df[‘qty’] = 1 and then df.groupby([[‘col1’, ‘col2’]].sum().reset_index(drop=True)
Probably not ideal, but this works:
df1.loc[df1.groupby(level=0).idxmax()['count']]
col1 col2 count
A AY 3
B BX 3
C CX 5
This works because the groupby within the loc
will return a list of indices, which loc
will then pull up.
Option 1: Include Ties
In case you have ties and want to show them.
Ties could be, for instance, both (B, BX) and (B, BY) occur 3 times.
# Prepare packages
import pandas as pd
# Create dummy date
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'col2': ['AX', 'AX', 'AY', 'AY', 'AY', 'BX', 'BX', 'BX', 'BY', 'BY', 'BY', 'CX', 'CX', 'CX', 'CX', 'CX'],
})
# Get Max Value by Group with Ties
df_count = (df.groupby('col1', as_index=0)['col2'].value_counts())
m = df_count.groupby(['col1'])['count'].transform(max) == df_count['count']
df1 = df_count[m]
col1 col2 count
0 A AY 3
2 B BX 3
3 B BY 3
4 C CX 5
Option 2: Short Code Ignoring Ties
df1 = (df
.groupby('col1')['col2']
.value_counts()
.groupby(level=0)
.head(1)
# .to_frame('count').reset_index() # Uncomment to get exact output requested
)