Category assigning based on percentile
Question:
I have the following dataframe
Group Country GDP
A a ***
A b ***
B a ***
B b ***
I want to assign catagory to gdp (High,low) based on within group percentile rank by creating a new column.
This is what I tried
def c(gr):
ser=gr['gdp']
p=np.nanpercentile(ser,50)
for i in ser:
if i>p:
return "high"
else:
return "low"
grouped = df.groupby('Group')
df['perf']=grouped.apply(c)
Perf column is returning nan. What I am doing wrong here?
Answers:
Use quantile
with numpy.where
and custom function:
def c(gr):
ser=gr['gdp']
#q=0.5 is by default, so can be omit
p = ser.quantile()
gr['perf'] = np.where( ser > p, 'high', 'low')
return gr
df = df.groupby('Group').apply(c)
This can be simplified by transform
:
q = df.groupby('Group')['gdp'].transform('quantile')
df['perf1'] = np.where(df['gdp'] > q, 'high', 'low')
Sample:
np.random.seed(12)
N = 15
L = list('abcd')
df = pd.DataFrame({'Group': np.random.choice(L, N),
'gdp': np.random.rand(N)})
df = df.sort_values('Group').reset_index(drop=True)
df.loc[[0,4,5,10,13,14], 'gdp'] = np.nan
#print (df)
def c(gr):
ser=gr['gdp']
#q=0.5 is by default, so can be omit
p = ser.quantile()
gr['perf'] = np.where( ser > p, 'high', 'low')
return gr
df = df.groupby('Group').apply(c)
q = df.groupby('Group')['gdp'].transform('quantile')
df['perf1'] = np.where( df['gdp'] > q, 'high', 'low')
print (df)
Group gdp perf perf1
0 a NaN low low
1 a 0.907267 high high
2 a 0.456051 low low
3 b 0.675998 low low
4 b NaN low low
5 b NaN low low
6 b 0.563141 low low
7 b 0.801265 high high
8 c 0.372834 low low
9 c 0.481530 high high
10 c NaN low low
11 d 0.082524 low low
12 d 0.725954 high high
13 d NaN low low
14 d NaN low low
Similar with R
df['output']=df.groupby('Group').gdp.apply(lambda x : np.where(x>x.quantile(0.75),'High','Low')).apply(pd.Series).stack().dropna().values
df
Out[333]:
Group gdp output
0 a NaN Low
1 a 0.772128 Low
2 a 0.070406 Low
3 a 0.859301 High
4 a NaN Low
5 a NaN Low
6 b 0.681299 High
7 b 0.040839 Low
8 c 0.896475 High
9 c 0.726527 Low
10 c NaN Low
11 c 0.244783 Low
12 c 0.563001 Low
13 c NaN Low
14 d NaN Low
I have the following dataframe
Group Country GDP
A a ***
A b ***
B a ***
B b ***
I want to assign catagory to gdp (High,low) based on within group percentile rank by creating a new column.
This is what I tried
def c(gr):
ser=gr['gdp']
p=np.nanpercentile(ser,50)
for i in ser:
if i>p:
return "high"
else:
return "low"
grouped = df.groupby('Group')
df['perf']=grouped.apply(c)
Perf column is returning nan. What I am doing wrong here?
Use quantile
with numpy.where
and custom function:
def c(gr):
ser=gr['gdp']
#q=0.5 is by default, so can be omit
p = ser.quantile()
gr['perf'] = np.where( ser > p, 'high', 'low')
return gr
df = df.groupby('Group').apply(c)
This can be simplified by transform
:
q = df.groupby('Group')['gdp'].transform('quantile')
df['perf1'] = np.where(df['gdp'] > q, 'high', 'low')
Sample:
np.random.seed(12)
N = 15
L = list('abcd')
df = pd.DataFrame({'Group': np.random.choice(L, N),
'gdp': np.random.rand(N)})
df = df.sort_values('Group').reset_index(drop=True)
df.loc[[0,4,5,10,13,14], 'gdp'] = np.nan
#print (df)
def c(gr):
ser=gr['gdp']
#q=0.5 is by default, so can be omit
p = ser.quantile()
gr['perf'] = np.where( ser > p, 'high', 'low')
return gr
df = df.groupby('Group').apply(c)
q = df.groupby('Group')['gdp'].transform('quantile')
df['perf1'] = np.where( df['gdp'] > q, 'high', 'low')
print (df)
Group gdp perf perf1
0 a NaN low low
1 a 0.907267 high high
2 a 0.456051 low low
3 b 0.675998 low low
4 b NaN low low
5 b NaN low low
6 b 0.563141 low low
7 b 0.801265 high high
8 c 0.372834 low low
9 c 0.481530 high high
10 c NaN low low
11 d 0.082524 low low
12 d 0.725954 high high
13 d NaN low low
14 d NaN low low
Similar with R
df['output']=df.groupby('Group').gdp.apply(lambda x : np.where(x>x.quantile(0.75),'High','Low')).apply(pd.Series).stack().dropna().values
df
Out[333]:
Group gdp output
0 a NaN Low
1 a 0.772128 Low
2 a 0.070406 Low
3 a 0.859301 High
4 a NaN Low
5 a NaN Low
6 b 0.681299 High
7 b 0.040839 Low
8 c 0.896475 High
9 c 0.726527 Low
10 c NaN Low
11 c 0.244783 Low
12 c 0.563001 Low
13 c NaN Low
14 d NaN Low