How can I calculate a date differential in Python across multiple rows and columns?

Question:

I’m trying to calculate the differential between the first Sent date/time in an ID and the last Received date/time in an ID, grouping them by Source and Destination. Sample (named test_subset) looks like this (but it is ‘000s of rows):

| ID   | From | To   | leg_sent | leg_received | Source | Destination |
|:-----|:-----|:-----|:---------|:-------------|:-------|:------------|
| 1btr | ABC  | XYZ  | 08:22:23 | 08:22:41     | GB     | FR          |
| 1btr | XYZ  | DEF  | 08:22:49 | 08:23:05     | GB     | FR          |
| 2vyu | LMN  | JFK  | 14:35:11 | 14:35:23     | US     | DE          |
| 2vyu | JFK  | HIJ  | 14:35:35 | 14:35:48     | US     | DE          |
| 2vyu | HIJ  | TPQ  | 14:35:51 | 14:36:25     | US     | DE          |

In the data ID is a transfer ID with the From and To rows being each of the legs of the transfer. In some instances their may be only 2 legs in others there may be as many as 10; what has to be calculated is the difference between first sent and last received, which won’t be on the same row. The Source and Destination would remain the same across all rows relating to the same ID.

I currently have a piece of code (below) that works for me to calculate the differential between leg_sent and leg_received on the same row and grouping it by the From and To codes:

pairs = (
test_subset
.withColumn('"leg_time"', F.datediff('seconds', F.col('"leg_sent"'), F.col('"leg_received"')))
.groupBy(['"From"', '"To"'])
.agg(
    F.count('*').alias('"num_paths"'),
    F.avg('"leg_time"').alias('"average_leg_time"'),
    F.min('"leg_time"').alias('"min_leg_time"'),
    F.max('"leg_time"').alias('"max_leg_time"')
)
.withColumn('"leg_range"', F.col('"max_leg_time"') - F.col('"min_leg_time"'))
)

I need to calculate the full time differential by calculating the difference between first leg_sent and last leg_received for the same ID and grouping by Source and Destination. Ideally I want to modify the same code block above to suit this.

Thanks!

Asked By: Qaribbean

||

Answers:

If I understand correctly to get the full time for the same id, source and destination, I think you can group by ID, Source and Destination and get the MIN(leg_sent) and MAX(leg_received) and then calculate the difference

pairs = (
test_subset
.groupBy(['"ID"', '"Source"', '"Destination"'])
.agg(
    F.min('"leg_sent"').alias('"min_leg_sent"'),
    F.max('"leg_received"').alias('"max_leg_received"')
)
.withColumn('"leg_full_time"', F.datediff('seconds', F.col('"min_leg_sent"'), F.col('"max_leg_received"')))
)
Answered By: Islam Elbanna