Pass percentiles to pandas agg function

Question:

I want to pass the numpy percentile() function through pandas’ agg() function as I do below with various other numpy statistics functions.

Right now I have a dataframe that looks like this:

AGGREGATE   MY_COLUMN
A           10
A           12
B           5
B           9
A           84
B           22

And my code looks like this:

grouped = dataframe.groupby('AGGREGATE')
column = grouped['MY_COLUMN']
column.agg([np.sum, np.mean, np.std, np.median, np.var, np.min, np.max])

The above code works, but I want to do something like

column.agg([np.sum, np.mean, np.percentile(50), np.percentile(95)])

I.e., specify various percentiles to return from agg().

How should this be done?

Asked By: slizb

||

Answers:

Perhaps not super efficient, but one way would be to create a function yourself:

def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_

Then include this in your agg:

In [11]: column.agg([np.sum, np.mean, np.std, np.median,
                     np.var, np.min, np.max, percentile(50), percentile(95)])
Out[11]:
           sum       mean        std  median          var  amin  amax  percentile_50  percentile_95
AGGREGATE
A          106  35.333333  42.158431      12  1777.333333    10    84             12           76.8
B           36  12.000000   8.888194       9    79.000000     5    22             12           76.8

Note sure this is how it should be done though…

Answered By: Andy Hayden

Being more specific, if you just want to aggregate your pandas groupby results using the percentile function, the python lambda function offers a pretty neat solution. Using the question’s notation, aggregating by the percentile 95, should be:

dataframe.groupby('AGGREGATE').agg(lambda x: np.percentile(x['COL'], q = 95))

You can also assign this function to a variable and use it in conjunction with other aggregation functions.

Answered By: prl900

Try this for the 50% and 95% percentile:

column.describe(percentiles=[0.5, 0.95])
Answered By: scottlittle

Multiple function can be called as below:

import pandas as pd

import numpy as np

import random

C = ['Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima', 'Ram', 'Ram', 'Shyam', 'Shyam', 'Mahima']

A = [ random.randint(0,100) for i in range(10) ]

B = [ random.randint(0,100) for i in range(10) ]

df = pd.DataFrame({ 'field_A': A, 'field_B': B, 'field_C': C })

print(df)

d = df.groupby('field_C')['field_A'].describe()[['mean', 'count', '25%', '50%', '75%']]
print(d)

I was unable to call median in this, but able to work other functions.

Answered By: Fakira

I really like the solution Andy Hayden gave, however, this had multiple issues for me:

  • If the dataframe has multiple columns, it aggregated over the columns instead of over the rows?
  • For me, the row names were percentile_0.5 (dot instead of underscore). Not sure what caused this, probably that I am using Python 3.
  • Need to import numpy as well instead of staying in pandas (I know, numpy is imported implicitely in pandas…)

Here is an updated version that fixes these issues:

def percentile(n):
    def percentile_(x):
        return x.quantile(n)
    percentile_.__name__ = 'percentile_{:2.0f}'.format(n*100)
    return percentile_
Answered By: Thomas

For situations where all you need is a subset of the describe (typically the most common needed statistics) you can just index the returned pandas series without needing any extra functions.

For example, I commonly find myself just needing to present the 25th, median, 75th and count. This can be done in just one line like so:

columns.agg('describe')[['25%', '50%', '75%', 'count']]

For specifying your own set of percentiles, the chosen answer is a good choice, but for simple use case, there is no need for extra functions.

Answered By: Maxim

You can have agg() use a custom function to be executed on specified column:

# 50th Percentile
def q50(x):
    return x.quantile(0.5)

# 90th Percentile
def q90(x):
    return x.quantile(0.9)

my_DataFrame.groupby(['AGGREGATE']).agg({'MY_COLUMN': [q50, q90, 'max']})
Answered By: Arun Nalpet

I believe the idiomatic way to do this in pandas is:

df.groupby("AGGREGATE").quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
Answered By: jvans
df.groupby("AGGREGATE").describe(percentiles=[0, 0.25, 0.5, 0.75, 0.95, 1])

by default describe function give us mean, count, std, min, max, and with percentiles array you can choose the needed percentiles.

Answered By: Agredalopez

Just to throw a more general solution into the ring. Assume you have a DF with just one column to group:

df = pd.DataFrame((('A',10),('A',12),('B',5),('B',9),('A',84),('B',22)), 
                    columns=['My_KEY', 'MY_COL1'])

One can aggregate and calcualte basically any descriptive metric with a list of anonymous (lambda) functions like:

df.groupby(['My_KEY']).agg( [np.sum, np.mean, lambda x: np.percentile(x, q=25)] )

However, if you have multiple columns to aggregate, you have to call a non anonymous function or call the columns explicitly:

df = pd.DataFrame((('A',10,3),('A',12,4),('B',5,6),('B',9,3),('A',84,2),('B',22,1)), 
                    columns=['My_KEY', 'MY_COL1', 'MY_COL2'])

# non-anonymous function
def percentil25 (x): 
    return np.percentile(x, q=25)

# type 1: call for both columns 
df.groupby(['My_KEY']).agg( [np.sum, np.mean, percentil25 ]  )

# type 2: call each column separately
df.groupby(['My_KEY']).agg( {'MY_COL1': [np.sum, np.mean, lambda x: np.percentile(x, q=25)],
                             'MY_COL2': np.size})
Answered By: magraf

More efficient solution with pandas.Series.quantile method:

df.groupby("AGGREGATE").agg(("YOUR_COL_NAME", lambda x: x.quantile(0.5))

With several percentile values

percentiles = [0.5, 0.9, 0.99]
quantile_funcs = [(p, lambda x: x.quantile(p)) for p in percentiles]
df.groupby("AGGREGATE").agg(quantile_funcs)
Answered By: Antiez

You can also perhaps use lambda to achieve the same. Some thing like below piece of code :

        agg(
            lambda x: [
                np.min(a=x), 
                np.percentile(q=25,a=x), 
                np.median(a=x), 
                np.percentile(q=75,a=x), 
                np.max(a=x)
    ]
)
Answered By: Ankit Khandelwal

This can provide some customization:

list_statistics = ['count','mean','min',lambda x: np.percentile(x,q=25),'max',lambda x: np.percentile(x,q=75)]
cols_to_rename = {'<lambda_0>':'P25','<lambda_1>':'P75'}
df_out.groupby('Country').agg(list_statistics).rename(columns=cols_to_rename)
Answered By: Jose Rondon

In case you have a dataframe with several columns and only want the quantiles for one column:

df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1])

and in case you want a 1 level dataframe you can add:

df.groupby("AGGREGATE")['MY_COLUMN'].quantile([0, 0.25, 0.5, 0.75, 0.95, 1]).reset_index()
Answered By: PV8

You can define simple lambda function

my_quant = lambda _q: lambda _data:_data.quantile(_q)

the use it with the required quantiles

df.aggregate([my_quant(0.1),my_quant(0.3),'median',my_quant(0.7)]

** If you want to keep the aggregations name without renaming the indexes you can create lambda creator and set the function names

def lambda_creator(func,name):
    func.__name__ = name
    return func

my_quant = lambda _q: lambda_creator(lambda _data: _data.quantile(_q),f'quantile_{_q}')
Answered By: רועי שמש
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.