Grouping consecutively timed rows into sessions pandas

Question:

I have logged user data from a video platform and need to group consecutively watched videos into sessions. The data looks like this:

        user_id     pid             view_time    
724117  2755025951  84142247679     2022-09-17 12:31:35     
724051  2755025951  84049359330     2022-09-17 12:35:01 
724179  2755025951  84206528723     2022-09-17 12:37:43 
723968  2755025951  83893305120     2022-09-17 12:45:04     
724000  2755025951  83963063552     2022-09-17 12:49:15     

For simplicity, let’s say I want to group all rows watched within a 5 minute interval (i.e. less than 5 minutes after the previous video). Then, the resulting data would count the number of videos in the given session and give the start time of that session, as such:

        user_id   session_start_time    number_of_videos    
724117  2755025951  2022-09-17 12:31:35     3   
723445  2755025951  2022-09-17 12:45:04     2   

I have tried to iterating through dataframe using iterrows() and comparing the timestamp of a certain row with the following row, but the best I can get is only two rows into one session (i.e. the trouble is recursively iterating an indefinite number of the subsequent rows to check whether they are within 5 minutes, rather than checking them against a specific row). I have also tried using shift (as per this answer: Access next, previous, or current row in pandas .loc[] assignment), but this runs into the same issue.

Any help would be appreciated!

Asked By: salamander

||

Answers:

Create groups based on view_time columns:

# If necessary
df['view_time'] = pd.to_datetime(df['view_time'])

view_grp = df.groupby('user_id')['view_time'].diff().gt('5min').cumsum()
out = (df.groupby(['user_id', view_grp], as_index=False)
         .agg(session_start_time=('view_time', 'first'),
              number_of_videos=('view_time', 'size')))

Output:

>>> out
      user_id  session_start_time  number_of_videos
0  2755025951 2022-09-17 12:31:35                 3
1  2755025951 2022-09-17 12:45:04                 2

Details:

diff = df.groupby('user_id')['view_time'].diff().rename('diff')
group = diff.gt('5min').cumsum().rename('group')
pd.concat([df, diff, group], axis=1)

# Output
           user_id          pid           view_time            diff  group
724117  2755025951  84142247679 2022-09-17 12:31:35             NaT      0
724051  2755025951  84049359330 2022-09-17 12:35:01 0 days 00:03:26      0
724179  2755025951  84206528723 2022-09-17 12:37:43 0 days 00:02:42      0
723968  2755025951  83893305120 2022-09-17 12:45:04 0 days 00:07:21      1
724000  2755025951  83963063552 2022-09-17 12:49:15 0 days 00:04:11      1
Answered By: Corralien
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.