Grouping consecutively timed rows into sessions pandas
Question:
I have logged user data from a video platform and need to group consecutively watched videos into sessions. The data looks like this:
user_id pid view_time
724117 2755025951 84142247679 2022-09-17 12:31:35
724051 2755025951 84049359330 2022-09-17 12:35:01
724179 2755025951 84206528723 2022-09-17 12:37:43
723968 2755025951 83893305120 2022-09-17 12:45:04
724000 2755025951 83963063552 2022-09-17 12:49:15
For simplicity, let’s say I want to group all rows watched within a 5 minute interval (i.e. less than 5 minutes after the previous video). Then, the resulting data would count the number of videos in the given session and give the start time of that session, as such:
user_id session_start_time number_of_videos
724117 2755025951 2022-09-17 12:31:35 3
723445 2755025951 2022-09-17 12:45:04 2
I have tried to iterating through dataframe using iterrows()
and comparing the timestamp of a certain row with the following row, but the best I can get is only two rows into one session (i.e. the trouble is recursively iterating an indefinite number of the subsequent rows to check whether they are within 5 minutes, rather than checking them against a specific row). I have also tried using shift
(as per this answer: Access next, previous, or current row in pandas .loc[] assignment), but this runs into the same issue.
Any help would be appreciated!
Answers:
Create groups based on view_time
columns:
# If necessary
df['view_time'] = pd.to_datetime(df['view_time'])
view_grp = df.groupby('user_id')['view_time'].diff().gt('5min').cumsum()
out = (df.groupby(['user_id', view_grp], as_index=False)
.agg(session_start_time=('view_time', 'first'),
number_of_videos=('view_time', 'size')))
Output:
>>> out
user_id session_start_time number_of_videos
0 2755025951 2022-09-17 12:31:35 3
1 2755025951 2022-09-17 12:45:04 2
Details:
diff = df.groupby('user_id')['view_time'].diff().rename('diff')
group = diff.gt('5min').cumsum().rename('group')
pd.concat([df, diff, group], axis=1)
# Output
user_id pid view_time diff group
724117 2755025951 84142247679 2022-09-17 12:31:35 NaT 0
724051 2755025951 84049359330 2022-09-17 12:35:01 0 days 00:03:26 0
724179 2755025951 84206528723 2022-09-17 12:37:43 0 days 00:02:42 0
723968 2755025951 83893305120 2022-09-17 12:45:04 0 days 00:07:21 1
724000 2755025951 83963063552 2022-09-17 12:49:15 0 days 00:04:11 1
I have logged user data from a video platform and need to group consecutively watched videos into sessions. The data looks like this:
user_id pid view_time
724117 2755025951 84142247679 2022-09-17 12:31:35
724051 2755025951 84049359330 2022-09-17 12:35:01
724179 2755025951 84206528723 2022-09-17 12:37:43
723968 2755025951 83893305120 2022-09-17 12:45:04
724000 2755025951 83963063552 2022-09-17 12:49:15
For simplicity, let’s say I want to group all rows watched within a 5 minute interval (i.e. less than 5 minutes after the previous video). Then, the resulting data would count the number of videos in the given session and give the start time of that session, as such:
user_id session_start_time number_of_videos
724117 2755025951 2022-09-17 12:31:35 3
723445 2755025951 2022-09-17 12:45:04 2
I have tried to iterating through dataframe using iterrows()
and comparing the timestamp of a certain row with the following row, but the best I can get is only two rows into one session (i.e. the trouble is recursively iterating an indefinite number of the subsequent rows to check whether they are within 5 minutes, rather than checking them against a specific row). I have also tried using shift
(as per this answer: Access next, previous, or current row in pandas .loc[] assignment), but this runs into the same issue.
Any help would be appreciated!
Create groups based on view_time
columns:
# If necessary
df['view_time'] = pd.to_datetime(df['view_time'])
view_grp = df.groupby('user_id')['view_time'].diff().gt('5min').cumsum()
out = (df.groupby(['user_id', view_grp], as_index=False)
.agg(session_start_time=('view_time', 'first'),
number_of_videos=('view_time', 'size')))
Output:
>>> out
user_id session_start_time number_of_videos
0 2755025951 2022-09-17 12:31:35 3
1 2755025951 2022-09-17 12:45:04 2
Details:
diff = df.groupby('user_id')['view_time'].diff().rename('diff')
group = diff.gt('5min').cumsum().rename('group')
pd.concat([df, diff, group], axis=1)
# Output
user_id pid view_time diff group
724117 2755025951 84142247679 2022-09-17 12:31:35 NaT 0
724051 2755025951 84049359330 2022-09-17 12:35:01 0 days 00:03:26 0
724179 2755025951 84206528723 2022-09-17 12:37:43 0 days 00:02:42 0
723968 2755025951 83893305120 2022-09-17 12:45:04 0 days 00:07:21 1
724000 2755025951 83963063552 2022-09-17 12:49:15 0 days 00:04:11 1