Frequency rolling count with groupby, Pandas

Question:

I’m trying to get the frequency count of a groupby which is grouped by name and date. I am having trouble figuring out how to do a 3 days roll count prior to the current day.

example: on 2022-01-05, John’s 3 days range are 2022-01-05 and 2022-01-01 with 2 times each, hence, showing 4 in the ‘3 days roll count’ column. And ‘3 day completion count’ column shows that John complete the work 1 time when working on 2022-01-05 and 2022-01-01. Kindly scroll down ‘result trying to get’ for df sample.

for clarification, 3 days range is a rolling range counting between 1 to 3, hence, the example shows 2 days. And the data starts on 2022-01-01, therefore, any dates before 2022-01-01 are reflects as 0. Lastly, on 2022-01-08, John’s 3 days roll count equals to 4 is because the 3 days are 3rd, 5th, and 7th.

sample df

╔════╦════════════╦══════╦════════════╗
║    ║  date_id   ║ name ║ completion ║
╠════╬════════════╬══════╬════════════╣
║  0 ║ 2022-01-01 ║ John ║ Y          ║
║  1 ║ 2022-01-01 ║ John ║ N          ║
║  2 ║ 2022-01-02 ║ Sam  ║ N          ║
║  3 ║ 2022-01-02 ║ Sam  ║ N          ║
║  4 ║ 2022-01-03 ║ John ║ N          ║
║  5 ║ 2022-01-03 ║ John ║ N          ║
║  6 ║ 2022-01-03 ║ May  ║ Y          ║
║  7 ║ 2022-01-04 ║ Sam  ║ Y          ║
║  8 ║ 2022-01-04 ║ Sam  ║ Y          ║
║  9 ║ 2022-01-05 ║ John ║ Y          ║
║ 10 ║ 2022-01-05 ║ May  ║ N          ║
║ 11 ║ 2022-01-05 ║ May  ║ Y          ║
║ 12 ║ 2022-01-06 ║ Sam  ║ N          ║
║ 13 ║ 2022-01-07 ║ John ║ Y          ║
║ 14 ║ 2022-01-08 ║ John ║ N          ║
╚════╩════════════╩══════╩════════════╝

result trying to get

╔════╦════════════╦══════╦════════════╦═══════════════════╦════════════════════════╗
║    ║  date_id   ║ name ║ completion ║ 3 days roll count ║ 3 day completion count ║
╠════╬════════════╬══════╬════════════╬═══════════════════╬════════════════════════╣
║  0 ║ 2022-01-01 ║ John ║ Y          ║                 0 ║                      0 ║
║  1 ║ 2022-01-01 ║ John ║ N          ║                 0 ║                      0 ║
║  2 ║ 2022-01-02 ║ Sam  ║ N          ║                 0 ║                      0 ║
║  3 ║ 2022-01-02 ║ Sam  ║ N          ║                 0 ║                      0 ║
║  4 ║ 2022-01-03 ║ John ║ N          ║                 2 ║                      1 ║
║  5 ║ 2022-01-03 ║ John ║ N          ║                 2 ║                      1 ║
║  6 ║ 2022-01-03 ║ May  ║ Y          ║                 0 ║                      0 ║
║  7 ║ 2022-01-04 ║ Sam  ║ Y          ║                 2 ║                      0 ║
║  8 ║ 2022-01-04 ║ Sam  ║ Y          ║                 2 ║                      0 ║
║  9 ║ 2022-01-05 ║ John ║ Y          ║                 4 ║                      1 ║
║ 10 ║ 2022-01-05 ║ May  ║ N          ║                 1 ║                      1 ║
║ 11 ║ 2022-01-05 ║ May  ║ Y          ║                 1 ║                      1 ║
║ 12 ║ 2022-01-06 ║ Sam  ║ N          ║                 4 ║                      2 ║
║ 13 ║ 2022-01-07 ║ John ║ Y          ║                 5 ║                      2 ║
║ 14 ║ 2022-01-08 ║ John ║ N          ║                 4 ║                      2 ║
╚════╩════════════╩══════╩════════════╩═══════════════════╩════════════════════════╝

code for sample df

    'date_id': [
'2022-01-01','2022-01-01','2022-01-02','2022-01-02','2022-01-03',
'2022-01-03','2022-01-03','2022-01-04','2022-01-04','2022-01-05',
'2022-01-05','2022-01-05','2022-01-06','2022-01-07','2022-01-08',
],
     
    'name': [
'John','John','Sam','Sam','John',
'John','May','Sam','Sam','John',
'May','May','Sam','John','John',
],
    
    'completion': [
'Y','N','N','N','N',
'N','Y','Y','Y','Y',
'N','Y','N','Y','N'
],
          
})

ps. i've tried using apply,lambda,shift with pd.to_timedelta found here to shift the dates but wasn't able to get the count right. 

kindly advise
Asked By: ManOnTheMoon

||

Answers:

Solution

# is work completed?
df['is_complete'] = df['completion'] == 'Y'

# groupby name and date and aggregate to calculate how many 
# times a work is attempted and how many times its actually completed
df1 = df.groupby(['name', 'date_id'])['is_complete'].agg(['count', 'sum'])

# Calculate a 4 days rolling sum per `name` and subtract the 
# current value to get the effective 3 days rolling sum
df1 = df1.groupby(level=0).rolling(4, min_periods=1).sum().droplevel(1) - df1
df1.columns = ['3 days roll count', '3 day completion count']

# Merge the result with original dataframe
df1 = df.merge(df1.reset_index(), how='left')

Result

       date_id  name completion  is_complete  3 days roll count   3 day completion count
0   2022-01-01  John          Y         True                0.0                      0.0
1   2022-01-01  John          N        False                0.0                      0.0
2   2022-01-02   Sam          N        False                0.0                      0.0
3   2022-01-02   Sam          N        False                0.0                      0.0
4   2022-01-03  John          N        False                2.0                      1.0
5   2022-01-03  John          N        False                2.0                      1.0
6   2022-01-03   May          Y         True                0.0                      0.0
7   2022-01-04   Sam          Y         True                2.0                      0.0
8   2022-01-04   Sam          Y         True                2.0                      0.0
9   2022-01-05  John          Y         True                4.0                      1.0
10  2022-01-05   May          N        False                1.0                      1.0
11  2022-01-05   May          Y         True                1.0                      1.0
12  2022-01-06   Sam          N        False                4.0                      2.0
13  2022-01-07  John          Y         True                5.0                      2.0
14  2022-01-08  John          N        False                4.0                      2.0
Answered By: Shubham Sharma
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.