creating conditional flag basis the multiple columns

Question:

Existing Dataframe :

Id       Month        Year       scheduled         completed
A         Jan         2021           0                 0
A         Feb         2021           1                 0
A         mar         2021           0                 0
B         June        2021           0                 1
B         July        2021           0                 1
B         Aug         2021           0                 1
B         Sep         2021           0                 1
C         Nov         2021           1                 0
C         Dec         2021           1                 0
C         Jan         2022           1                 0
C         Feb         2022           1                 0

Expected Dataframe :

 Id           status
 A          defaulter
 B          non_defaulter
 C          defaulter

I am trying to create a status tag for each basis their activity. if for three consecutive Month if completed column remains 0 , that Id is to be tagged as "defaulter" else "non_defaulter"

Asked By: Roshankumar

||

Answers:

You can use a double groupby to count the consecutive 1s in completed, then to ensure there is at least 1 stretch greater or equal to N=3:

N = 3

# is the row a zero?
m = df['completed'].eq(0)

# count the consecutive zeros
(m.groupby([df['Id'], (~m).cumsum()])
   .sum().ge(N)
   # check if there is at least one stretch of value >= N
   .groupby(level=0).any()
   # convert the True/False into strings
   .map({False: 'non_defaulter', True: 'defaulter'})
   .reset_index(name='status')
)

Output:

  Id         status
0  A      defaulter
1  B  non_defaulter
2  C      defaulter
Answered By: mozway

Idea is aggregate per consecutive groups of 0 values per Id and helper Series s created by Series.cumsum, count Trues values, then because possible multiple groups per Id aggregate max and last set values in numpy.where:

N = 3
m = df['completed'].ne(0)
#if need check in completed OR scheduled column
#m = df[['completed','scheduled']].ne(0).any(axis=1)

df = ((~m).groupby([df['Id'], m.cumsum().mask(m, -1)])
          .sum()
          .groupby(level=0)
          .max()
          .reset_index(name='status')
          .assign(status = lambda x: np.where(x['status'].ge(N), 
                                            'defaulter','non_defaulter')))
       
print (df)
  Id         status
0  A      defaulter
1  B  non_defaulter
2  C      defaulter
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.