Find percentile stats of a given column
Question:
I have a pandas data frame my_df, where I can find the mean(), median(), mode() of a given column:
my_df['field_A'].mean()
my_df['field_A'].median()
my_df['field_A'].mode()
I am wondering is it possible to find more detailed stats such as 90 percentile? Thanks!
Answers:
I figured out below would work:
my_df.dropna().quantile([0.0, .9])
assume series s
s = pd.Series(np.arange(100))
Get quantiles for [.1, .2, .3, .4, .5, .6, .7, .8, .9]
s.quantile(np.linspace(.1, 1, 9, 0))
0.1 9.9
0.2 19.8
0.3 29.7
0.4 39.6
0.5 49.5
0.6 59.4
0.7 69.3
0.8 79.2
0.9 89.1
dtype: float64
OR
s.quantile(np.linspace(.1, 1, 9, 0), 'lower')
0.1 9
0.2 19
0.3 29
0.4 39
0.5 49
0.6 59
0.7 69
0.8 79
0.9 89
dtype: int32
- You can use the
pandas.DataFrame.quantile()
function.
- If you look at the API for
quantile()
, you will see it takes an argument for how to do interpolation. If you want a quantile that falls between two positions in your data:
- ‘linear’, ‘lower’, ‘higher’, ‘midpoint’, or ‘nearest’.
- By default, it performs linear interpolation.
- These interpolation methods are discussed in the Wikipedia article for percentile
import pandas as pd
import numpy as np
# sample data
np.random.seed(2023) # for reproducibility
data = {'Category': np.random.choice(['hot', 'cold'], size=(10,)),
'field_A': np.random.randint(0, 100, size=(10,)),
'field_B': np.random.randint(0, 100, size=(10,))}
df = pd.DataFrame(data)
df.field_A.mean() # Same as df['field_A'].mean()
# 51.1
df.field_A.median()
# 50.0
# You can call `quantile(i)` to get the i'th quantile,
# where `i` should be a fractional number.
df.field_A.quantile(0.1) # 10th percentile
# 15.6
df.field_A.quantile(0.5) # same as median
# 50.0
df.field_A.quantile(0.9) # 90th percentile
# 88.8
df.groupby('Category').field_A.quantile(0.1)
#Category
#cold 28.8
#hot 8.6
#Name: field_A, dtype: float64
df
Category field_A field_B
0 cold 96 58
1 cold 22 28
2 hot 17 81
3 cold 53 71
4 cold 47 63
5 hot 77 48
6 cold 39 32
7 hot 69 29
8 hot 88 49
9 hot 3 49
You can even give multiple columns with null values and get multiple quantile values (I use 95 percentile for outlier treatment)
my_df[['field_A','field_B']].dropna().quantile([0.0, .5, .90, .95])
a very easy and efficient way is to call the describe function on the particular column
df['field_A'].describe()
this will give you the mean ,max ,median and the 75th percentile
Describe will give you quartiles, if you want percentiles, you can do something like
df['YOUR_COLUMN_HERE'].describe(percentiles=[.1, .2, .3, .4, .5, .6 , .7, .8, .9, 1])
I have a pandas data frame my_df, where I can find the mean(), median(), mode() of a given column:
my_df['field_A'].mean()
my_df['field_A'].median()
my_df['field_A'].mode()
I am wondering is it possible to find more detailed stats such as 90 percentile? Thanks!
I figured out below would work:
my_df.dropna().quantile([0.0, .9])
assume series s
s = pd.Series(np.arange(100))
Get quantiles for [.1, .2, .3, .4, .5, .6, .7, .8, .9]
s.quantile(np.linspace(.1, 1, 9, 0))
0.1 9.9
0.2 19.8
0.3 29.7
0.4 39.6
0.5 49.5
0.6 59.4
0.7 69.3
0.8 79.2
0.9 89.1
dtype: float64
OR
s.quantile(np.linspace(.1, 1, 9, 0), 'lower')
0.1 9
0.2 19
0.3 29
0.4 39
0.5 49
0.6 59
0.7 69
0.8 79
0.9 89
dtype: int32
- You can use the
pandas.DataFrame.quantile()
function.- If you look at the API for
quantile()
, you will see it takes an argument for how to do interpolation. If you want a quantile that falls between two positions in your data:- ‘linear’, ‘lower’, ‘higher’, ‘midpoint’, or ‘nearest’.
- By default, it performs linear interpolation.
- These interpolation methods are discussed in the Wikipedia article for percentile
- If you look at the API for
import pandas as pd
import numpy as np
# sample data
np.random.seed(2023) # for reproducibility
data = {'Category': np.random.choice(['hot', 'cold'], size=(10,)),
'field_A': np.random.randint(0, 100, size=(10,)),
'field_B': np.random.randint(0, 100, size=(10,))}
df = pd.DataFrame(data)
df.field_A.mean() # Same as df['field_A'].mean()
# 51.1
df.field_A.median()
# 50.0
# You can call `quantile(i)` to get the i'th quantile,
# where `i` should be a fractional number.
df.field_A.quantile(0.1) # 10th percentile
# 15.6
df.field_A.quantile(0.5) # same as median
# 50.0
df.field_A.quantile(0.9) # 90th percentile
# 88.8
df.groupby('Category').field_A.quantile(0.1)
#Category
#cold 28.8
#hot 8.6
#Name: field_A, dtype: float64
df
Category field_A field_B
0 cold 96 58
1 cold 22 28
2 hot 17 81
3 cold 53 71
4 cold 47 63
5 hot 77 48
6 cold 39 32
7 hot 69 29
8 hot 88 49
9 hot 3 49
You can even give multiple columns with null values and get multiple quantile values (I use 95 percentile for outlier treatment)
my_df[['field_A','field_B']].dropna().quantile([0.0, .5, .90, .95])
a very easy and efficient way is to call the describe function on the particular column
df['field_A'].describe()
this will give you the mean ,max ,median and the 75th percentile
Describe will give you quartiles, if you want percentiles, you can do something like
df['YOUR_COLUMN_HERE'].describe(percentiles=[.1, .2, .3, .4, .5, .6 , .7, .8, .9, 1])