Pandas: group by index value, then calculate quantile?

Question:

I have a DataFrame indexed on the month column (set using df = df.set_index('month'), in case that’s relevant):

             org_code  ratio_cost   
month
2010-08-01   1847      8.685939     
2010-08-01   1848      7.883951     
2010-08-01   1849      6.798465     
2010-08-01   1850      7.352603     
2010-09-01   1847      8.778501     

I want to add a new column called quantile, which will assign a quantile value to each row, based on the value of its ratio_cost for that month.

So the example above might look like this:

             org_code  ratio_cost   quantile
month
2010-08-01   1847      8.685939     100 
2010-08-01   1848      7.883951     66.6 
2010-08-01   1849      6.798465     0  
2010-08-01   1850      7.352603     33.3
2010-09-01   1847      8.778501     100

How can I do this? I’ve tried this:

df['quantile'] = df.groupby('month')['ratio_cost'].rank(pct=True)

But I get KeyError: 'month'.

UPDATE: I can reproduce the bug.

Here is my CSV file: http://pastebin.com/raw/6xbjvEL0

And here is the code to reproduce the error:

df = pd.read_csv('temp.csv')
df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')
df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile']

I’m using Pandas 0.17.1 on OSX.

Asked By: Richard

||

Answers:

You have to sort_index before rank:

import pandas as pd

df = pd.read_csv('http://pastebin.com/raw/6xbjvEL0')

df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')

df = df.sort_index()

df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile'].head()

month
2010-08-01    0.2500
2010-08-01    0.6875
2010-08-01    0.6250
2010-08-01    0.9375
2010-08-01    0.7500
Name: percentile, dtype: float64
Answered By: jezrael

a quantile looks at the distribution of the ratio cost and find the 95% percentile region. you calculate the quantile by calculating the q_cutoff value. The resulting values are then masked.

month=['2010-08-01','2010-08-01','2010-08-01','2010-08-01','2010-09-01']
org_code=[1847,1848,1849,1850,1847]
ratio_cost=[8.685939,7.883951,6.798465,7.352603,8.778501] 
df=pd.DataFrame({'month':month,'org_code':org_code,'ratio_cost':ratio_cost})

q_cutoff = df['ratio_cost'].quantile(0.95)
mask=df['ratio_cost'] < q_cutoff
trimmed_df=df[mask]

 print(trimmed_df)
Answered By: Golden Lion
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.