Create indicator column threshold of values of another column for different rows with groupby in pandas

Question

I have the following dataframe

import pandas as pd
pd.DataFrame({'id':[1,1,2,2,3,3],
             'phase': ['pre', 'post','pre', 'post','pre', 'post'],
             'n': [5,6,7,3,10,10]})

I want to create a new column (new_col) which indicates if n>=5 for both pre & post by id.

The output dataframe looks like this

pd.DataFrame({'id':[1,1,2,2,3,3],
             'phase': ['pre', 'post','pre', 'post','pre', 'post'],
             'n': [5,6,7,3,10,10],
             'new_col':[1,1,0,0,1,1]})

I would like to avoid any solution using pd.pivot_table

How could I do that ?

Asked By: quant

||

Source

Answer 1

You can test if both values pre/post exist per groups by replace them to missing values in Series.where with aggregate sets and compare this values, last map boolean Series to final column:

m = df['n'].ge(5)
s = df['phase'].where(m).groupby(df['id']).agg(set) == set({'pre','post'})
df['new_col'] =  df['id'].map(s).astype(int)

Or if possible test if number of unique values after where is equal 2, in another words only phase/post values exist in phase column:

m = df['n'].ge(5)
s = df['phase'].where(m).groupby(df['id']).nunique().eq(2)
df['new_col'] =  df['id'].map(s).astype(int)

print (df)
   id phase   n  new_col
0   1   pre   5        1
1   1  post   6        1
2   2   pre   7        0
3   2  post   3        0
4   3   pre  10        1
5   3  post  10        1

Another idea is test values separately with numpy.intersect1d:

m = df['n'].ge(5)
s1 = df.loc[m & df['phase'].eq('pre'), 'id']
s2 = df.loc[m & df['phase'].eq('post'), 'id']

df['new_col'] = df['id'].isin(np.intersect1d(s1, s2)).astype(int)
print (df)
   id phase   n  new_col
0   1   pre   5        1
1   1  post   6        1
2   2   pre   7        0
3   2  post   3        0
4   3   pre  10        1
5   3  post  10        1

Answered By: jezrael

Answer 2

With pandas.Series.transform on specific condition:

df['new_col'] = df.groupby('id', sort=False)['n']
    .transform(lambda x: (x >= 5).all().astype(int))

  id phase   n  new_col
0   1   pre   5        1
1   1  post   6        1
2   2   pre   7        0
3   2  post   3        0
4   3   pre  10        1
5   3  post  10        1

Answered By: RomanPerekhrest

Create indicator column threshold of values of another column for different rows with groupby in pandas

Question:

Answers: