Pandas: calculate time difference between different milestones in column

Question:

I have a table like this:

id tm milestone
00335c06f96a21e4089c49a5da 2023-02-01 18:13:42.307543 A
00335c06f96a21e4089c49a5da 2023-02-01 18:14:42.307543 A
00335c06f96a21e4089c49a5da 2023-02-01 18:15:42.307543 A
00335c06f96a21e4089c49a5da 2023-02-01 18:19:10.307543 B
00335c06f96a21e4089c49a5da 2023-02-01 18:21:05.307543 C
0043545f6b9112c7e471d5cc81 2023-02-02 08:06:42.307543 A
0043545f6b9112c7e471d5cc81 2023-02-02 08:07:42.307543 A
0043545f6b9112c7e471d5cc81 2023-02-02 09:05:42.307543 B
0043545f6b9112c7e471d5cc81 2023-02-02 09:05:42.307543 B
ffe92ae6b0962e800dbdbdf00a 2023-02-12 19:05:42.307543 A
ffe92ae6b0962e800dbdbdf00a 2023-02-12 19:05:42.307543 B
ffe92ae6b0962e800dbdbdf00a 2023-02-12 19:07:42.307543 B
ffe92ae6b0962e800dbdbdf00a 2023-02-13 21:03:42.307543 C

What I want to achieve – is convert it to transactional view / pivot so that for each id to have columns with max time difference between milestones (A, B, C, D..). For starting – columns A-B, A-C.
Treaky part that there are multiple records with same milestone – I am taking min time for each category. But also not all id have full sequence, in this case I need to mark this time difference as None.

For Example:

id A-B, min A-C, min
00335c06f96a21e4089c49a5da 5.47 7.38
0043545f6b9112c7e471d5cc81 59.0 None
ffe92ae6b0962e800dbdbdf00a 0 1558.0

What I’m trying to do:
Take min values for each milestone:
df.groupby([‘id’,’milestone ‘])[‘tm’].min().to_frame(‘tm’).reset_index(),

And apply .diff() like .sort_values(['id','tm']).groupby('id')['tm'].diff() But can’t feagure out how to properly do this to create time-difference columns.
Other idea is to use .pivot but can’t come up with idea how to aggregate milestones inside.

Appreciate any help.

Input DataFrame for example:

pd.DataFrame({'id':['00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','0043545f6b9112c7e471d5cc81','0043545f6b9112c7e471d5cc81','0043545f6b9112c7e471d5cc81','0043545f6b9112c7e471d5cc81','ffe92ae6b0962e800dbdbdf00a','ffe92ae6b0962e800dbdbdf00a','ffe92ae6b0962e800dbdbdf00a','ffe92ae6b0962e800dbdbdf00a'],
             'tm':['2023-02-01 18:13:42.307543','2023-02-01 18:14:42.307543','2023-02-01 18:15:42.307543','2023-02-01 18:19:10.307543','2023-02-01 18:21:05.307543',
                  '2023-02-02 08:06:42.307543','2023-02-02 08:07:42.307543','2023-02-02 09:05:42.307543','2023-02-02 09:05:42.307543',
                  '2023-02-12 19:05:42.307543','2023-02-12 19:05:42.307543','2023-02-12 19:07:42.307543','2023-02-13 21:03:42.307543'],
              'milestone':['A','A','A','B','C','A','A','B','B','A','B','B','C']})
Asked By: Alex_Y

||

Answers:

Update

I am taking min time for each category

Just change the aggregation function of pivot_table:

df['tm'] = pd.to_datetime(df['tm'])  # maybe unnecessary

m = df['milestone'] == 'A'  # first milestone or not

out = (pd.merge(df[m], df[~m], on='id')
         .assign(milestone=lambda x: x.pop('milestone_x') + '-' + x.pop('milestone_y'),
                 tm=lambda x: x.pop('tm_y').sub(x.pop('tm_x')).dt.total_seconds() / 60)
         .pivot_table(index='id', columns='milestone', values='tm', aggfunc='first')
         .round(2).rename_axis(columns=None).reset_index())

Output:

>>> out
                           id    A-B      A-C
0  00335c06f96a21e4089c49a5da   5.47     7.38
1  0043545f6b9112c7e471d5cc81  59.00      NaN
2  ffe92ae6b0962e800dbdbdf00a   0.00  1558.00

IIUC, you want to compare time between the first milestone (A) and other ones, so you can use merge before reshaping your dataframe with pivot_table:

df['tm'] = pd.to_datetime(df['tm'])  # maybe unnecessary

m = df['milestone'] == 'A'  # first milestone or not

out = (pd.merge(df[m], df[~m], on='id')
         .assign(milestone=lambda x: x.pop('milestone_x') + '-' + x.pop('milestone_y'),
                 tm=lambda x: x.pop('tm_y').sub(x.pop('tm_x')).dt.total_seconds() // 60)
         .pivot_table(index='id', columns='milestone', values='tm', aggfunc=min)
         .rename_axis(columns=None).reset_index())

Output:

>>> out
                           id   A-B     A-C
0  00335c06f96a21e4089c49a5da   3.0     5.0
1  0043545f6b9112c7e471d5cc81  58.0     NaN
2  ffe92ae6b0962e800dbdbdf00a   0.0  1558.0
Answered By: Corralien
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.