how to make col2 the cumsum of the second col1 == 'x' in DateTime per group?
Question:
I would like a column in a pandas
dataframe
that
- counts the number of times
'outcome2'
is observed in 'value'
through 'datetime'
- starting from the second observation of
'outcome2'
- per
'ID'
or df.index
import pandas as pd
from io import StringIO
import datetime
txt= """
ID,datetime,value
A,12/10/2022 10:00:00,outcome1
A,12/10/2022 11:15:10,outcome2
A,14/10/2022 15:30:30,outcome1
B,11/10/2022 11:30:22,outcome1
B,15/10/2022 22:44:11,outcome2
B,15/10/2022 23:30:22,outcome3
B,15/10/2022 23:31:11,outcome2
"""
df = pd.read_csv(StringIO(txt),
parse_dates=[1],
dayfirst=True)
.assign(id_index= lambda x_df: x_df
.groupby('ID', sort=False).ngroup())
.set_index("id_index")
.rename_axis(index=None)
df = df.assign(value_test = lambda df: df['value']=='outcome2',
value_cumsum= lambda df: df.groupby('ID', sort=False)['value_test'].cumsum())
ID datetime value value_test value_cumsum
0 A 2022-10-12 10:00:00 outcome1 False 0
0 A 2022-10-12 11:15:10 outcome2 True 1
0 A 2022-10-14 15:30:30 outcome1 False 1
1 B 2022-10-11 11:30:22 outcome1 False 0
1 B 2022-10-15 22:44:11 outcome2 True 1
1 B 2022-10-15 23:30:22 outcome3 False 1
1 B 2022-10-15 23:31:11 outcome2 True 2
I tried assigning a third variable to df
using if-statements in the lambda
functions. It failed in a way others have experienced 1. edit now it works, but is not neat:
df = df.assign(value_test = lambda df: df['value']=='outcome2',
value_cumsum = lambda df: df.groupby('ID', sort=False)['value_test'].cumsum(),
outcome2 = lambda df: 0 if df[df[value_cumsum]==1] or df[df[value_cumsum]==0]
else df[value_cumsum]-1 if df[df[value_cumsum] > 1]
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# edit:
df = df.assign(value_test = lambda df: df['value']=='outcome2',
cumsum = lambda df: df.groupby('ID', sort=False)
['value_test'].cumsum(),
outcome2 = lambda df:df['cumsum'].apply(
lambda cumsum: 0 if cumsum == 1
else (0 if cumsum == 0
else (cumsum-1 if cumsum > 1
else 'NaN'))))
I need only the accumulated sum (running total) of counts of 'outcome2'
in 'value'
starting from the second observation of 'outcome2'
per group.*
Any suggestions, please?
And is it possible to solve without the intermediate step making value_test
or value_cumsum
?
desired df
ID datetime value outcome2
0 A 2022-10-12 10:00:00 outcome1 0
0 A 2022-10-12 11:15:10 outcome2 0
0 A 2022-10-14 15:30:30 outcome1 0
1 B 2022-10-11 11:30:22 outcome1 0
1 B 2022-10-15 22:44:11 outcome2 0
1 B 2022-10-15 23:30:22 outcome3 0
1 B 2022-10-15 23:31:11 outcome2 1
Answers:
You can use:
df['value_cumsum'] = (df.groupby('ID')['value_test']
.cumsum().sub(1).where(df['value_test'], 0)
)
Or, if you also want to label the False:
df['value_cumsum'] = (df.groupby('ID')['value_test']
.cumsum().sub(1).clip(lower=0)
)
output:
ID datetime value value_test value_cumsum
0 A 2022-10-12 10:00:00 outcome1 False 0
0 A 2022-10-12 11:15:10 outcome2 True 0
0 A 2022-10-14 15:30:30 outcome1 False 0
1 B 2022-10-11 11:30:22 outcome1 False 0
1 B 2022-10-15 22:44:11 outcome2 True 0
1 B 2022-10-15 23:30:22 outcome3 False 0
1 B 2022-10-15 23:31:11 outcome2 True 1
without intermediate:
df['value_cumsum'] = (df['value'].eq('outcome2')
.groupby(df['ID'])
.cumsum().sub(1).clip(lower=0)
)
output:
ID datetime value value_cumsum
0 A 2022-10-12 10:00:00 outcome1 0
0 A 2022-10-12 11:15:10 outcome2 0
0 A 2022-10-14 15:30:30 outcome1 0
1 B 2022-10-11 11:30:22 outcome1 0
1 B 2022-10-15 22:44:11 outcome2 0
1 B 2022-10-15 23:30:22 outcome3 0
1 B 2022-10-15 23:31:11 outcome2 1
I would like a column in a pandas
dataframe
that
- counts the number of times
'outcome2'
is observed in'value'
through'datetime'
- starting from the second observation of
'outcome2'
- per
'ID'
ordf.index
import pandas as pd
from io import StringIO
import datetime
txt= """
ID,datetime,value
A,12/10/2022 10:00:00,outcome1
A,12/10/2022 11:15:10,outcome2
A,14/10/2022 15:30:30,outcome1
B,11/10/2022 11:30:22,outcome1
B,15/10/2022 22:44:11,outcome2
B,15/10/2022 23:30:22,outcome3
B,15/10/2022 23:31:11,outcome2
"""
df = pd.read_csv(StringIO(txt),
parse_dates=[1],
dayfirst=True)
.assign(id_index= lambda x_df: x_df
.groupby('ID', sort=False).ngroup())
.set_index("id_index")
.rename_axis(index=None)
df = df.assign(value_test = lambda df: df['value']=='outcome2',
value_cumsum= lambda df: df.groupby('ID', sort=False)['value_test'].cumsum())
ID datetime value value_test value_cumsum
0 A 2022-10-12 10:00:00 outcome1 False 0
0 A 2022-10-12 11:15:10 outcome2 True 1
0 A 2022-10-14 15:30:30 outcome1 False 1
1 B 2022-10-11 11:30:22 outcome1 False 0
1 B 2022-10-15 22:44:11 outcome2 True 1
1 B 2022-10-15 23:30:22 outcome3 False 1
1 B 2022-10-15 23:31:11 outcome2 True 2
I tried assigning a third variable to df
using if-statements in the lambda
functions. It failed in a way others have experienced 1. edit now it works, but is not neat:
df = df.assign(value_test = lambda df: df['value']=='outcome2',
value_cumsum = lambda df: df.groupby('ID', sort=False)['value_test'].cumsum(),
outcome2 = lambda df: 0 if df[df[value_cumsum]==1] or df[df[value_cumsum]==0]
else df[value_cumsum]-1 if df[df[value_cumsum] > 1]
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# edit:
df = df.assign(value_test = lambda df: df['value']=='outcome2',
cumsum = lambda df: df.groupby('ID', sort=False)
['value_test'].cumsum(),
outcome2 = lambda df:df['cumsum'].apply(
lambda cumsum: 0 if cumsum == 1
else (0 if cumsum == 0
else (cumsum-1 if cumsum > 1
else 'NaN'))))
I need only the accumulated sum (running total) of counts of 'outcome2'
in 'value'
starting from the second observation of 'outcome2'
per group.*
Any suggestions, please?
And is it possible to solve without the intermediate step making value_test
or value_cumsum
?
desired df
ID datetime value outcome2
0 A 2022-10-12 10:00:00 outcome1 0
0 A 2022-10-12 11:15:10 outcome2 0
0 A 2022-10-14 15:30:30 outcome1 0
1 B 2022-10-11 11:30:22 outcome1 0
1 B 2022-10-15 22:44:11 outcome2 0
1 B 2022-10-15 23:30:22 outcome3 0
1 B 2022-10-15 23:31:11 outcome2 1
You can use:
df['value_cumsum'] = (df.groupby('ID')['value_test']
.cumsum().sub(1).where(df['value_test'], 0)
)
Or, if you also want to label the False:
df['value_cumsum'] = (df.groupby('ID')['value_test']
.cumsum().sub(1).clip(lower=0)
)
output:
ID datetime value value_test value_cumsum
0 A 2022-10-12 10:00:00 outcome1 False 0
0 A 2022-10-12 11:15:10 outcome2 True 0
0 A 2022-10-14 15:30:30 outcome1 False 0
1 B 2022-10-11 11:30:22 outcome1 False 0
1 B 2022-10-15 22:44:11 outcome2 True 0
1 B 2022-10-15 23:30:22 outcome3 False 0
1 B 2022-10-15 23:31:11 outcome2 True 1
without intermediate:
df['value_cumsum'] = (df['value'].eq('outcome2')
.groupby(df['ID'])
.cumsum().sub(1).clip(lower=0)
)
output:
ID datetime value value_cumsum
0 A 2022-10-12 10:00:00 outcome1 0
0 A 2022-10-12 11:15:10 outcome2 0
0 A 2022-10-14 15:30:30 outcome1 0
1 B 2022-10-11 11:30:22 outcome1 0
1 B 2022-10-15 22:44:11 outcome2 0
1 B 2022-10-15 23:30:22 outcome3 0
1 B 2022-10-15 23:31:11 outcome2 1