Counting number of events up until a specific date in a pandas dataframe

Question:

I would like to count the number of events that occured up until a date that is specified in the row.
I want to know how many events took place per ID up until the specified date.

Original dataframe:

Date ID Event
23.01.23 1 AA
19.01.23 1 AB
23.12.22 1 AA
23.01.23 2 AA
19.01.23 2 AA
23.12.22 2 AB

Expected result:

Date ID Event Count of AA Count of AB
23.01.23 1 AA 2 1
19.01.23 1 AB 1 1
23.12.22 1 AA 1 0
23.01.23 2 AA 2 1
19.01.23 2 AA 1 1
23.12.22 2 AB 0 1

I was thinking to do a groupby, then pivot the table but the results were incorrect. Then I tried to do it SQL style and I almost have it but it’s not yet what I need.

d = {'Date': ["23.01.23", "19.01.23", "23.12.22", "23.01.23", "19.01.23", "23.12.22"],'ID': [1,1,1,2,2,2], "Event": ["AA","AB","AA","AA","AA","AB"]}

test_df = pd.DataFrame(data = d)

import duckdb

duckdb.query("SELECT Date, Id, Event, COUNT() OVER(PARTITION BY ID, event ) as 'count' FROM test_df").df()

outputs:

enter image description here

The SQL query is missing a distinct within the partition by but I didn’t manage to make it work.

Any type of solution (pandas, python, SQL) is greatly appreciated.

Asked By: ZuZiTeK

||

Answers:

You can use crosstab+groupby.cumsum and merge:

# ensure datetime
test_df['Date'] = pd.to_datetime(test_df['Date'])

# compute crosstab, cumsum per group and merge
test_df.merge((pd.crosstab([test_df['Date'], test_df['ID']], test_df['Event'])
                 .groupby(level='ID').cumsum()
                ), left_on=['Date', 'ID'], right_index=True
             )

Output:

        Date  ID Event  AA  AB
0 2023-01-23   1    AA   2   1
1 2023-01-19   1    AB   1   1
2 2022-12-23   1    AA   1   0
3 2023-01-23   2    AA   2   1
4 2023-01-19   2    AA   1   1
5 2022-12-23   2    AB   0   1
Answered By: mozway
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.