how to make col2 the cumsum of the second col1 == 'x' in DateTime per group?

Question:

I would like a column in a pandas dataframe that

  • counts the number of times 'outcome2' is observed in 'value' through 'datetime'
  • starting from the second observation of 'outcome2'
  • per 'ID' or df.index
import pandas as pd
from io import StringIO
import datetime 

txt= """
ID,datetime,value
A,12/10/2022 10:00:00,outcome1
A,12/10/2022 11:15:10,outcome2
A,14/10/2022 15:30:30,outcome1
B,11/10/2022 11:30:22,outcome1
B,15/10/2022 22:44:11,outcome2
B,15/10/2022 23:30:22,outcome3
B,15/10/2022 23:31:11,outcome2
"""

df = pd.read_csv(StringIO(txt),
             parse_dates=[1],
             dayfirst=True)
             .assign(id_index= lambda x_df: x_df
             .groupby('ID', sort=False).ngroup())
             .set_index("id_index")
             .rename_axis(index=None)

df = df.assign(value_test = lambda df: df['value']=='outcome2',
               value_cumsum= lambda df: df.groupby('ID', sort=False)['value_test'].cumsum())

  ID            datetime     value value_test   value_cumsum
0  A 2022-10-12 10:00:00  outcome1      False              0
0  A 2022-10-12 11:15:10  outcome2       True              1
0  A 2022-10-14 15:30:30  outcome1      False              1
1  B 2022-10-11 11:30:22  outcome1      False              0
1  B 2022-10-15 22:44:11  outcome2       True              1
1  B 2022-10-15 23:30:22  outcome3      False              1
1  B 2022-10-15 23:31:11  outcome2       True              2

I tried assigning a third variable to df using if-statements in the lambda functions. It failed in a way others have experienced 1. edit now it works, but is not neat:

df = df.assign(value_test = lambda df: df['value']=='outcome2',
               value_cumsum = lambda df: df.groupby('ID', sort=False)['value_test'].cumsum(),
               outcome2 = lambda df: 0 if df[df[value_cumsum]==1] or df[df[value_cumsum]==0]
               else df[value_cumsum]-1 if df[df[value_cumsum] > 1]

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

# edit:
df = df.assign(value_test = lambda df: df['value']=='outcome2',
               cumsum = lambda df: df.groupby('ID', sort=False)
                        ['value_test'].cumsum(),
               outcome2 = lambda df:df['cumsum'].apply(
                          lambda cumsum: 0 if cumsum == 1
                                 else (0 if cumsum == 0
                                 else (cumsum-1 if cumsum > 1
                                 else 'NaN'))))

I need only the accumulated sum (running total) of counts of 'outcome2' in 'value' starting from the second observation of 'outcome2' per group.*

Any suggestions, please?

And is it possible to solve without the intermediate step making value_test or value_cumsum?

desired df

  ID            datetime     value outcome2
0  A 2022-10-12 10:00:00  outcome1        0
0  A 2022-10-12 11:15:10  outcome2        0
0  A 2022-10-14 15:30:30  outcome1        0
1  B 2022-10-11 11:30:22  outcome1        0
1  B 2022-10-15 22:44:11  outcome2        0
1  B 2022-10-15 23:30:22  outcome3        0
1  B 2022-10-15 23:31:11  outcome2        1
Asked By: Johan_A

||

Answers:

You can use:

df['value_cumsum'] = (df.groupby('ID')['value_test']
                      .cumsum().sub(1).where(df['value_test'], 0)
                      )

Or, if you also want to label the False:

df['value_cumsum'] = (df.groupby('ID')['value_test']
                      .cumsum().sub(1).clip(lower=0)
                      )

output:

  ID            datetime     value  value_test  value_cumsum
0  A 2022-10-12 10:00:00  outcome1       False             0
0  A 2022-10-12 11:15:10  outcome2        True             0
0  A 2022-10-14 15:30:30  outcome1       False             0
1  B 2022-10-11 11:30:22  outcome1       False             0
1  B 2022-10-15 22:44:11  outcome2        True             0
1  B 2022-10-15 23:30:22  outcome3       False             0
1  B 2022-10-15 23:31:11  outcome2        True             1

without intermediate:

df['value_cumsum'] = (df['value'].eq('outcome2')
                      .groupby(df['ID'])
                      .cumsum().sub(1).clip(lower=0)
                     )

output:

  ID            datetime     value  value_cumsum
0  A 2022-10-12 10:00:00  outcome1             0
0  A 2022-10-12 11:15:10  outcome2             0
0  A 2022-10-14 15:30:30  outcome1             0
1  B 2022-10-11 11:30:22  outcome1             0
1  B 2022-10-15 22:44:11  outcome2             0
1  B 2022-10-15 23:30:22  outcome3             0
1  B 2022-10-15 23:31:11  outcome2             1
Answered By: mozway