Pandas: group by index value, then calculate quantile?
Question:
I have a DataFrame indexed on the month
column (set using df = df.set_index('month')
, in case that’s relevant):
org_code ratio_cost
month
2010-08-01 1847 8.685939
2010-08-01 1848 7.883951
2010-08-01 1849 6.798465
2010-08-01 1850 7.352603
2010-09-01 1847 8.778501
I want to add a new column called quantile
, which will assign a quantile value to each row, based on the value of its ratio_cost
for that month.
So the example above might look like this:
org_code ratio_cost quantile
month
2010-08-01 1847 8.685939 100
2010-08-01 1848 7.883951 66.6
2010-08-01 1849 6.798465 0
2010-08-01 1850 7.352603 33.3
2010-09-01 1847 8.778501 100
How can I do this? I’ve tried this:
df['quantile'] = df.groupby('month')['ratio_cost'].rank(pct=True)
But I get KeyError: 'month'
.
UPDATE: I can reproduce the bug.
Here is my CSV file: http://pastebin.com/raw/6xbjvEL0
And here is the code to reproduce the error:
df = pd.read_csv('temp.csv')
df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')
df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile']
I’m using Pandas 0.17.1 on OSX.
Answers:
You have to sort_index
before rank
:
import pandas as pd
df = pd.read_csv('http://pastebin.com/raw/6xbjvEL0')
df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')
df = df.sort_index()
df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile'].head()
month
2010-08-01 0.2500
2010-08-01 0.6875
2010-08-01 0.6250
2010-08-01 0.9375
2010-08-01 0.7500
Name: percentile, dtype: float64
a quantile looks at the distribution of the ratio cost and find the 95% percentile region. you calculate the quantile by calculating the q_cutoff value. The resulting values are then masked.
month=['2010-08-01','2010-08-01','2010-08-01','2010-08-01','2010-09-01']
org_code=[1847,1848,1849,1850,1847]
ratio_cost=[8.685939,7.883951,6.798465,7.352603,8.778501]
df=pd.DataFrame({'month':month,'org_code':org_code,'ratio_cost':ratio_cost})
q_cutoff = df['ratio_cost'].quantile(0.95)
mask=df['ratio_cost'] < q_cutoff
trimmed_df=df[mask]
print(trimmed_df)
I have a DataFrame indexed on the month
column (set using df = df.set_index('month')
, in case that’s relevant):
org_code ratio_cost
month
2010-08-01 1847 8.685939
2010-08-01 1848 7.883951
2010-08-01 1849 6.798465
2010-08-01 1850 7.352603
2010-09-01 1847 8.778501
I want to add a new column called quantile
, which will assign a quantile value to each row, based on the value of its ratio_cost
for that month.
So the example above might look like this:
org_code ratio_cost quantile
month
2010-08-01 1847 8.685939 100
2010-08-01 1848 7.883951 66.6
2010-08-01 1849 6.798465 0
2010-08-01 1850 7.352603 33.3
2010-09-01 1847 8.778501 100
How can I do this? I’ve tried this:
df['quantile'] = df.groupby('month')['ratio_cost'].rank(pct=True)
But I get KeyError: 'month'
.
UPDATE: I can reproduce the bug.
Here is my CSV file: http://pastebin.com/raw/6xbjvEL0
And here is the code to reproduce the error:
df = pd.read_csv('temp.csv')
df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')
df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile']
I’m using Pandas 0.17.1 on OSX.
You have to sort_index
before rank
:
import pandas as pd
df = pd.read_csv('http://pastebin.com/raw/6xbjvEL0')
df.month = pd.to_datetime(df.month, unit='s')
df = df.set_index('month')
df = df.sort_index()
df['percentile'] = df.groupby(df.index)['ratio_cost'].rank(pct=True)
print df['percentile'].head()
month
2010-08-01 0.2500
2010-08-01 0.6875
2010-08-01 0.6250
2010-08-01 0.9375
2010-08-01 0.7500
Name: percentile, dtype: float64
a quantile looks at the distribution of the ratio cost and find the 95% percentile region. you calculate the quantile by calculating the q_cutoff value. The resulting values are then masked.
month=['2010-08-01','2010-08-01','2010-08-01','2010-08-01','2010-09-01']
org_code=[1847,1848,1849,1850,1847]
ratio_cost=[8.685939,7.883951,6.798465,7.352603,8.778501]
df=pd.DataFrame({'month':month,'org_code':org_code,'ratio_cost':ratio_cost})
q_cutoff = df['ratio_cost'].quantile(0.95)
mask=df['ratio_cost'] < q_cutoff
trimmed_df=df[mask]
print(trimmed_df)