Create indicator column threshold of values of another column for different rows with groupby in pandas
Question:
I have the following dataframe
import pandas as pd
pd.DataFrame({'id':[1,1,2,2,3,3],
'phase': ['pre', 'post','pre', 'post','pre', 'post'],
'n': [5,6,7,3,10,10]})
I want to create a new column (new_col
) which indicates if n>=5
for both pre
& post
by id
.
The output dataframe looks like this
pd.DataFrame({'id':[1,1,2,2,3,3],
'phase': ['pre', 'post','pre', 'post','pre', 'post'],
'n': [5,6,7,3,10,10],
'new_col':[1,1,0,0,1,1]})
I would like to avoid any solution using pd.pivot_table
How could I do that ?
Answers:
You can test if both values pre/post
exist per groups by replace them to missing values in Series.where
with aggregate set
s and compare this values, last map boolean Series to final column:
m = df['n'].ge(5)
s = df['phase'].where(m).groupby(df['id']).agg(set) == set({'pre','post'})
df['new_col'] = df['id'].map(s).astype(int)
Or if possible test if number of unique values after where
is equal 2
, in another words only phase/post
values exist in phase
column:
m = df['n'].ge(5)
s = df['phase'].where(m).groupby(df['id']).nunique().eq(2)
df['new_col'] = df['id'].map(s).astype(int)
print (df)
id phase n new_col
0 1 pre 5 1
1 1 post 6 1
2 2 pre 7 0
3 2 post 3 0
4 3 pre 10 1
5 3 post 10 1
Another idea is test values separately with numpy.intersect1d
:
m = df['n'].ge(5)
s1 = df.loc[m & df['phase'].eq('pre'), 'id']
s2 = df.loc[m & df['phase'].eq('post'), 'id']
df['new_col'] = df['id'].isin(np.intersect1d(s1, s2)).astype(int)
print (df)
id phase n new_col
0 1 pre 5 1
1 1 post 6 1
2 2 pre 7 0
3 2 post 3 0
4 3 pre 10 1
5 3 post 10 1
With pandas.Series.transform
on specific condition:
df['new_col'] = df.groupby('id', sort=False)['n']
.transform(lambda x: (x >= 5).all().astype(int))
id phase n new_col
0 1 pre 5 1
1 1 post 6 1
2 2 pre 7 0
3 2 post 3 0
4 3 pre 10 1
5 3 post 10 1
I have the following dataframe
import pandas as pd
pd.DataFrame({'id':[1,1,2,2,3,3],
'phase': ['pre', 'post','pre', 'post','pre', 'post'],
'n': [5,6,7,3,10,10]})
I want to create a new column (new_col
) which indicates if n>=5
for both pre
& post
by id
.
The output dataframe looks like this
pd.DataFrame({'id':[1,1,2,2,3,3],
'phase': ['pre', 'post','pre', 'post','pre', 'post'],
'n': [5,6,7,3,10,10],
'new_col':[1,1,0,0,1,1]})
I would like to avoid any solution using pd.pivot_table
How could I do that ?
You can test if both values pre/post
exist per groups by replace them to missing values in Series.where
with aggregate set
s and compare this values, last map boolean Series to final column:
m = df['n'].ge(5)
s = df['phase'].where(m).groupby(df['id']).agg(set) == set({'pre','post'})
df['new_col'] = df['id'].map(s).astype(int)
Or if possible test if number of unique values after where
is equal 2
, in another words only phase/post
values exist in phase
column:
m = df['n'].ge(5)
s = df['phase'].where(m).groupby(df['id']).nunique().eq(2)
df['new_col'] = df['id'].map(s).astype(int)
print (df)
id phase n new_col
0 1 pre 5 1
1 1 post 6 1
2 2 pre 7 0
3 2 post 3 0
4 3 pre 10 1
5 3 post 10 1
Another idea is test values separately with numpy.intersect1d
:
m = df['n'].ge(5)
s1 = df.loc[m & df['phase'].eq('pre'), 'id']
s2 = df.loc[m & df['phase'].eq('post'), 'id']
df['new_col'] = df['id'].isin(np.intersect1d(s1, s2)).astype(int)
print (df)
id phase n new_col
0 1 pre 5 1
1 1 post 6 1
2 2 pre 7 0
3 2 post 3 0
4 3 pre 10 1
5 3 post 10 1
With pandas.Series.transform
on specific condition:
df['new_col'] = df.groupby('id', sort=False)['n']
.transform(lambda x: (x >= 5).all().astype(int))
id phase n new_col
0 1 pre 5 1
1 1 post 6 1
2 2 pre 7 0
3 2 post 3 0
4 3 pre 10 1
5 3 post 10 1