Aggregating Pandas DataFrame rows based on multiple criteria (userID, end date = start date, etc.)

Question:

I’ve read through a lot of answers on how to aggregate rows in
a pandas dataframe but I’ve had a hard time figuring out how to apply
it to my case. I have a dataframe containing trips data for vehicles.
So each vehicle within a given day can do several trips. Here’s an
example below:

vehicleID start pos time end pos time duration (seconds) meters travelled
XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00 1434 2000
XXXXX 2021-10-19 13:49:09+00:00 2021-10-19 13:59:29+00:00 620 5000
XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:26:40+00:00 364 70000
YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00 932 8000
YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00 728 90000

It often happens that the start time of a trip, on a given day, is
only a few minutes after the end time of the previous trip, which
means that these can be chained into a single trip.

I would like to aggregate the rows so that if the new start pos time
overlaps with the previous pos time, or a gap of less than 30 minutes
happens between the two, these become a single row, summing the
duration of the trip in seconds and meters travelled, obviously by
vehicleID. The new df should also contain those trips that didn’t require the aggregation (edited for clarity). So this is the output I’m trying to get:

vehicleID start pos time end pos time duration (seconds) meters travelled
XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00 1434 2000
XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:59:29+00:00 984 75000
YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00 932 8000
YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00 728 90000

I feel like a groupby and an agg would be involved by I have no
clue how to go about this. Any help would be appreciated! Thanks!

Asked By: Adorable

||

Answers:

There is probably a more efficient way to code this up, but something like this should work (the new_df has what you’re looking for):

NOTE: code below assumes start and end times are in datetime format


df = pd.DataFrame({'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY',
                      4: 'YYYYY'},
        'start pos time': {0: '2021-10-26 06:01:12+00:00',
                           1: '2021-10-19 13:49:09+00:00',
                           2: '2021-10-19 13:20:36+00:00',
                           3: '2022-09-10 15:14:07+00:00',
                           4: '2022-08-28 15:16:35+00:00'},
        'end pos time': {0: '2021-10-26 06:25:06+00:00',
                         1: '2021-10-19 13:59:29+00:00',
                         2: '2021-10-19 13:26:40+00:00',
                         3: '2022-09-10 15:29:39+00:00',
                         4: '2022-08-28 15:28:43+00:00'},
        'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728},
        'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
        })

# sort dataframe by ID and then start time of trip
df = df.sort_values(by=['vehicleID', 'start pos time'])

# create a new column with the end time of the previous ride
df.loc[:, 'prev end'] = df['end pos time'].shift(1)

# create a new column with the difference between the start time of the current trip and the end time of the prior one
df.loc[:, 'diff'] = df.loc[:, 'start pos time'] - df.loc[:, 'prev end']


# helper function to convert difference between datetime objects to seconds
def get_total_seconds(datetime_delta):
    return datetime_delta.total_seconds()


# convert difference column to seconds
df.loc[:, 'diff'] = df['diff'].apply(get_total_seconds)

# where vehicle IDs are the same and the difference between the start time of the current trip and end time of the
# prior trip is less than or equal to 30 minutes, change the start time of the current trip to the start time of the 
# prior one
df.loc[((df['vehicleID'] == df['vehicleID'].shift(1)) & (df['diff'] <= 30*60)), 'start pos time'] = df['start pos time'].shift(1)

# create a new dataframe, grouped by vehicle ID and trip start time, using the maximum end time for each group
new_df = df.groupby(['vehicleID', 'start pos time'], as_index=False).agg({'end pos time':'max',
                                                                          'duration (seconds)':'sum',
                                                                          'meters travelled':'sum'})

EDIT: if there may be >2 trips that need to be aggregated (as @ouroboros1 pointed out) you can replace everything after the "convert difference column to seconds" code with:

# [based on @ouroboros1 solution] where vehicle IDs are the same and the difference between the start time of the current
# trip and end time of the prior trip is less than or equal to 30 minutes, put trips in the same "group"
df.loc[:, 'group'] = ((df['vehicleID'] != df['vehicleID'].shift(1)) | (df['diff'] > 30*60)).cumsum()

# create a new dataframe, grouped by vehicle ID and group, using the minimum start time and maximum end time for each group
new_df = df.groupby(['vehicleID', 'group'], as_index=False).agg({'start pos time':'min',
                                                                 'end pos time':'max',
                                                                 'duration (seconds)':'sum',
                                                                 'meters travelled':'sum'})
Answered By: jpk

I believe I have found a solution.

Setup

import pandas as pd
from datetime import timedelta

data = {'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY', 
                      4: 'YYYYY'}, 
        'start pos time': {0: '2021-10-26 06:01:12+00:00', 
                           1: '2021-10-19 13:49:09+00:00', 
                           2: '2021-10-19 13:20:36+00:00', 
                           3: '2022-09-10 15:14:07+00:00', 
                           4: '2022-08-28 15:16:35+00:00'}, 
        'end pos time': {0: '2021-10-26 06:25:06+00:00', 
                         1: '2021-10-19 13:59:29+00:00', 
                         2: '2021-10-19 13:26:40+00:00', 
                         3: '2022-09-10 15:29:39+00:00', 
                         4: '2022-08-28 15:28:43+00:00'}, 
        'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728}, 
        'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
        }

df = pd.DataFrame(data)

Assumptions:

  • All groups (unique values) in col vehicleID are in consecutive order.
  • Per group in col vehicleID the associated timestamps in col start pos time are sorted in descending order.

Problem

Within each group inside col vehicleID, if a start pos time is smaller than the end pos time of the previous trip (i.e. in the next row), or less than 30 minutes larger, then these rows should become a single row, with min for start pos time, max for end pos time, and sum for both duration and meters travelled.


Solution

# if still needed, change date time strings into timestamps
df[['start pos time','end pos time']] = df[['start pos time','end pos time']].
    apply(lambda x: pd.to_datetime(x, infer_datetime_format=True))

# check (start time + timedelta 29m+59s) < (end time shifted)
cond1 = (df.loc[:,'end pos time']+timedelta(minutes=29, seconds=59))
    .lt(df.loc[:,'start pos time'].shift(1))

# check `vehicleID` != it's own shift (this means a new group is starting)
# i.e. a new group should always get `True`
cond2 = (df.loc[:,'vehicleID'] != df.loc[:,'vehicleID'].shift(1))

# cumsum result of OR check conds
cond = (cond1 | cond2).cumsum()

# apply groupby on ['vehicleID' & cond] and aggregate appropriate functions
# (adding vehicleID is now unnecessary, but this keeps the col in the data)
res = df.groupby(['vehicleID', cond], as_index=False).agg(
    {'start pos time':'min',
     'end pos time':'max',
     'duration (seconds)':'sum',
     'meters travelled':'sum'}
    )

print(res)

  vehicleID            start pos time              end pos time  
0     XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00   
1     XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:59:29+00:00   
2     YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00   
3     YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00   

   duration (seconds)  meters travelled  
0                1434              2000  
1                 984             75000  
2                 932              8000  
3                 728             90000  

I’ve run a check: solution should also work if you have more than two trips in succession that stay within the defined range consecutively.


Update:
In the answer by @BeRT2me, the values for duration (seconds) from all original rows that merge into a new row aren’t summed, rather the duration is recalculated based on the new start and end time. This makes a lot of sense. If you want to do this with my method, just adjust the last part of the code as follows:

# cut out `duration` here:
res = df.groupby(['vehicleID', cond], as_index=False).agg(
    {'start pos time':'min',
     'end pos time':'max',
     # 'duration (seconds)':'sum',
     'meters travelled':'sum'}
    )

# and recalculate the duration
res['duration (seconds)'] = res['end pos time'].
    sub(res['start pos time']).dt.total_seconds()
Answered By: ouroboros1
def func(d):
    mask = d.start_pos_time.sub(d.end_pos_time.shift(-1)).lt('30m')
    d.loc[mask, 'start_pos_time'] = d.start_pos_time.shift(-1)
    d = d.groupby('start_pos_time', as_index=False).agg({'end_pos_time': 'max', 'meters_travelled': 'sum'})
    return d

df = df.groupby('vehicleID').apply(func).reset_index('vehicleID').reset_index(drop=True)

df['duration_(seconds)'] = (df.end_pos_time - df.start_pos_time).dt.total_seconds()
print(df)

Output:

  vehicleID            start_pos_time              end_pos_time  meters_travelled  duration_(seconds)
0     XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:59:29+00:00             75000              2333.0
1     XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00              2000              1434.0
2     YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00             90000               728.0
3     YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00              8000               932.0
Answered By: BeRT2me