Pass percentiles to pandas agg function
Question:
I want to pass the numpy percentile()
function through pandas’ agg()
function as I do below with various other numpy statistics functions.
Right now I have a dataframe that looks like this:
AGGREGATE MY_COLUMN
A 10
A 12
B 5
B 9
A 84
B 22
And my code looks like this:
grouped = dataframe.groupby('AGGREGATE')
column = grouped['MY_COLUMN']
column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max])
The above code works, but I want to do something like
column.agg([np.sum, np.mean, np.percentile(50), np.percentile(95)])
I.e., specify various percentiles to return from agg()
.
How should this be done?
Answers:
Perhaps not super efficient, but one way would be to create a function yourself:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
Then include this in your agg
:
In [11]: column.agg([np.sum, np.mean, np.std, np.median,
np.var, np.min, np.max, percentile(50), percentile(95)])
Out[11]:
sum mean std median var amin amax percentile_50 percentile_95
AGGREGATE
A 106 35.333333 42.158431 12 1777.333333 10 84 12 76.8
B 36 12.000000 8.888194 9 79.000000 5 22 12 76.8
Note sure this is how it should be done though…
Being more specific, if you just want to aggregate your pandas groupby results using the percentile function, the python lambda function offers a pretty neat solution. Using the question’s notation, aggregating by the percentile 95, should be:
dataframe.groupby('AGGREGATE').agg(lambda x: np.percentile(x['COL'], q = 95))
You can also assign this function to a variable and use it in conjunction with other aggregation functions.
Try this for the 50% and 95% percentile:
column.describe(percentiles=[0.5, 0.95])
Multiple function can be called as below:
import pandas as pd
import numpy as np
import random
C = ['Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima', 'Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima']
A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]
df = pd.DataFrame({ 'field_A': A, 'field_B': B, 'field_C': C })
print(df)
d = df.groupby('field_C')['field_A'].describe()[['mean', 'count', '25%', '50%', '75%']]
print(d)
I was unable to call median in this, but able to work other functions.
I really like the solution Andy Hayden gave, however, this had multiple issues for me:
- If the dataframe has multiple columns, it aggregated over the columns instead of over the rows?
- For me, the row names were percentile_0.5 (dot instead of underscore). Not sure what caused this, probably that I am using Python 3.
- Need to import numpy as well instead of staying in pandas (I know, numpy is imported implicitely in pandas…)
Here is an updated version that fixes these issues:
def percentile(n):
def percentile_(x):
return x.quantile(n)
percentile_.__name__ = 'percentile_{:2.0f}'.format(n*100)
return percentile_
For situations where all you need is a subset of the describe
(typically the most common needed statistics) you can just index the returned pandas series without needing any extra functions.
For example, I commonly find myself just needing to present the 25th, median, 75th and count. This can be done in just one line like so:
columns.agg('describe')[['25%', '50%', '75%', 'count']]
For specifying your own set of percentiles, the chosen answer is a good choice, but for simple use case, there is no need for extra functions.
You can have agg()
use a custom function to be executed on specified column:
# 50th Percentile
def q50(x):
return x.quantile(0.5)
# 90th Percentile
def q90(x):
return x.quantile(0.9)
my_DataFrame.groupby(['AGGREGATE']).agg({'MY_COLUMN': [q50, q90, 'max']})
I believe the idiomatic way to do this in pandas is:
df.groupby("AGGREGATE").quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
df.groupby("AGGREGATE").describe(percentiles=[0, 0.25, 0.5, 0.75, 0.95, 1])
by default describe
function give us mean, count, std, min, max
, and with percentiles array you can choose the needed percentiles.
Just to throw a more general solution into the ring. Assume you have a DF with just one column to group:
df = pd.DataFrame((('A',10),('A',12),('B',5),('B',9),('A',84),('B',22)),
columns=['My_KEY', 'MY_COL1'])
One can aggregate and calcualte basically any descriptive metric with a list of anonymous (lambda) functions like:
df.groupby(['My_KEY']).agg( [np.sum, np.mean, lambda x: np.percentile(x, q=25)] )
However, if you have multiple columns to aggregate, you have to call a non anonymous function or call the columns explicitly:
df = pd.DataFrame((('A',10,3),('A',12,4),('B',5,6),('B',9,3),('A',84,2),('B',22,1)),
columns=['My_KEY', 'MY_COL1', 'MY_COL2'])
# non-anonymous function
def percentil25 (x):
return np.percentile(x, q=25)
# type 1: call for both columns
df.groupby(['My_KEY']).agg( [np.sum, np.mean, percentil25 ] )
# type 2: call each column separately
df.groupby(['My_KEY']).agg( {'MY_COL1': [np.sum, np.mean, lambda x: np.percentile(x, q=25)],
'MY_COL2': np.size})
More efficient solution with pandas.Series.quantile
method:
df.groupby("AGGREGATE").agg(("YOUR_COL_NAME", lambda x: x.quantile(0.5))
With several percentile values
percentiles = [0.5, 0.9, 0.99]
quantile_funcs = [(p, lambda x: x.quantile(p)) for p in percentiles]
df.groupby("AGGREGATE").agg(quantile_funcs)
You can also perhaps use lambda to achieve the same. Some thing like below piece of code :
agg(
lambda x: [
np.min(a=x),
np.percentile(q=25,a=x),
np.median(a=x),
np.percentile(q=75,a=x),
np.max(a=x)
]
)
This can provide some customization:
list_statistics = ['count','mean','min',lambda x: np.percentile(x,q=25),'max',lambda x: np.percentile(x,q=75)]
cols_to_rename = {'<lambda_0>':'P25','<lambda_1>':'P75'}
df_out.groupby('Country').agg(list_statistics).rename(columns=cols_to_rename)
In case you have a dataframe with several columns and only want the quantiles for one column:
df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
and in case you want a 1 level dataframe you can add:
df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1]).reset_index()
You can define simple lambda function
my_quant = lambda _q: lambda _data:_data.quantile(_q)
the use it with the required quantiles
df.aggregate([my_quant(0.1),my_quant(0.3),'median',my_quant(0.7)]
** If you want to keep the aggregations name without renaming the indexes you can create lambda creator and set the function names
def lambda_creator(func,name):
func.__name__ = name
return func
my_quant = lambda _q: lambda_creator(lambda _data: _data.quantile(_q),f'quantile_{_q}')
I want to pass the numpy percentile()
function through pandas’ agg()
function as I do below with various other numpy statistics functions.
Right now I have a dataframe that looks like this:
AGGREGATE MY_COLUMN
A 10
A 12
B 5
B 9
A 84
B 22
And my code looks like this:
grouped = dataframe.groupby('AGGREGATE')
column = grouped['MY_COLUMN']
column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max])
The above code works, but I want to do something like
column.agg([np.sum, np.mean, np.percentile(50), np.percentile(95)])
I.e., specify various percentiles to return from agg()
.
How should this be done?
Perhaps not super efficient, but one way would be to create a function yourself:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
Then include this in your agg
:
In [11]: column.agg([np.sum, np.mean, np.std, np.median,
np.var, np.min, np.max, percentile(50), percentile(95)])
Out[11]:
sum mean std median var amin amax percentile_50 percentile_95
AGGREGATE
A 106 35.333333 42.158431 12 1777.333333 10 84 12 76.8
B 36 12.000000 8.888194 9 79.000000 5 22 12 76.8
Note sure this is how it should be done though…
Being more specific, if you just want to aggregate your pandas groupby results using the percentile function, the python lambda function offers a pretty neat solution. Using the question’s notation, aggregating by the percentile 95, should be:
dataframe.groupby('AGGREGATE').agg(lambda x: np.percentile(x['COL'], q = 95))
You can also assign this function to a variable and use it in conjunction with other aggregation functions.
Try this for the 50% and 95% percentile:
column.describe(percentiles=[0.5, 0.95])
Multiple function can be called as below:
import pandas as pd
import numpy as np
import random
C = ['Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima', 'Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima']
A = [ random.randint(0,100) for i in range(10) ]
B = [ random.randint(0,100) for i in range(10) ]
df = pd.DataFrame({ 'field_A': A, 'field_B': B, 'field_C': C })
print(df)
d = df.groupby('field_C')['field_A'].describe()[['mean', 'count', '25%', '50%', '75%']]
print(d)
I was unable to call median in this, but able to work other functions.
I really like the solution Andy Hayden gave, however, this had multiple issues for me:
- If the dataframe has multiple columns, it aggregated over the columns instead of over the rows?
- For me, the row names were percentile_0.5 (dot instead of underscore). Not sure what caused this, probably that I am using Python 3.
- Need to import numpy as well instead of staying in pandas (I know, numpy is imported implicitely in pandas…)
Here is an updated version that fixes these issues:
def percentile(n):
def percentile_(x):
return x.quantile(n)
percentile_.__name__ = 'percentile_{:2.0f}'.format(n*100)
return percentile_
For situations where all you need is a subset of the describe
(typically the most common needed statistics) you can just index the returned pandas series without needing any extra functions.
For example, I commonly find myself just needing to present the 25th, median, 75th and count. This can be done in just one line like so:
columns.agg('describe')[['25%', '50%', '75%', 'count']]
For specifying your own set of percentiles, the chosen answer is a good choice, but for simple use case, there is no need for extra functions.
You can have agg()
use a custom function to be executed on specified column:
# 50th Percentile
def q50(x):
return x.quantile(0.5)
# 90th Percentile
def q90(x):
return x.quantile(0.9)
my_DataFrame.groupby(['AGGREGATE']).agg({'MY_COLUMN': [q50, q90, 'max']})
I believe the idiomatic way to do this in pandas is:
df.groupby("AGGREGATE").quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
df.groupby("AGGREGATE").describe(percentiles=[0, 0.25, 0.5, 0.75, 0.95, 1])
by default describe
function give us mean, count, std, min, max
, and with percentiles array you can choose the needed percentiles.
Just to throw a more general solution into the ring. Assume you have a DF with just one column to group:
df = pd.DataFrame((('A',10),('A',12),('B',5),('B',9),('A',84),('B',22)),
columns=['My_KEY', 'MY_COL1'])
One can aggregate and calcualte basically any descriptive metric with a list of anonymous (lambda) functions like:
df.groupby(['My_KEY']).agg( [np.sum, np.mean, lambda x: np.percentile(x, q=25)] )
However, if you have multiple columns to aggregate, you have to call a non anonymous function or call the columns explicitly:
df = pd.DataFrame((('A',10,3),('A',12,4),('B',5,6),('B',9,3),('A',84,2),('B',22,1)),
columns=['My_KEY', 'MY_COL1', 'MY_COL2'])
# non-anonymous function
def percentil25 (x):
return np.percentile(x, q=25)
# type 1: call for both columns
df.groupby(['My_KEY']).agg( [np.sum, np.mean, percentil25 ] )
# type 2: call each column separately
df.groupby(['My_KEY']).agg( {'MY_COL1': [np.sum, np.mean, lambda x: np.percentile(x, q=25)],
'MY_COL2': np.size})
More efficient solution with pandas.Series.quantile
method:
df.groupby("AGGREGATE").agg(("YOUR_COL_NAME", lambda x: x.quantile(0.5))
With several percentile values
percentiles = [0.5, 0.9, 0.99]
quantile_funcs = [(p, lambda x: x.quantile(p)) for p in percentiles]
df.groupby("AGGREGATE").agg(quantile_funcs)
You can also perhaps use lambda to achieve the same. Some thing like below piece of code :
agg(
lambda x: [
np.min(a=x),
np.percentile(q=25,a=x),
np.median(a=x),
np.percentile(q=75,a=x),
np.max(a=x)
]
)
This can provide some customization:
list_statistics = ['count','mean','min',lambda x: np.percentile(x,q=25),'max',lambda x: np.percentile(x,q=75)]
cols_to_rename = {'<lambda_0>':'P25','<lambda_1>':'P75'}
df_out.groupby('Country').agg(list_statistics).rename(columns=cols_to_rename)
In case you have a dataframe with several columns and only want the quantiles for one column:
df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
and in case you want a 1 level dataframe you can add:
df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1]).reset_index()
You can define simple lambda function
my_quant = lambda _q: lambda _data:_data.quantile(_q)
the use it with the required quantiles
df.aggregate([my_quant(0.1),my_quant(0.3),'median',my_quant(0.7)]
** If you want to keep the aggregations name without renaming the indexes you can create lambda creator and set the function names
def lambda_creator(func,name):
func.__name__ = name
return func
my_quant = lambda _q: lambda_creator(lambda _data: _data.quantile(_q),f'quantile_{_q}')