How to use Pandas diff() with other columns value as period?

Question:

I have a dataframe looking like this:

Timestamp description
0 Parser starts
12 parsing
24 parsing
26 Parsing finished
28 Parser starts
45 Parsing finished

I want to calculate the how long each parse took. I therefore want the difference between timestamps where (df['description'] == 'Parsing finished') and (df['description'] == 'Parser starts'). I know I can use pd.diff() but I can only find how to use it with a set period. I want to set the period based on the description value.

Expected output:

Timestamp description difference
0 Parser starts NaN
12 parsing NaN
24 parsing NaN
26 Parsing finished 26
28 Parser starts NaN
45 Parsing finished 17

I thought of looping over each row but this seems counterintuitive when using Pandas.

EDIT: updated wrong value thanks to comment of @mozway. Made myself more clear with below table:

Timestamp description
0 Parser starts
12 parsing
24 parsing
26 Parsing finished
27 Uploading results
28 Parser starts
45 Parsing finished

I do not want the timestamp of uploading results (or other values in between parser starts and parsing finished) to be part of the diff. Therefore grouping on parser starts does not provide the result Im looking for. I only want the diff between parser starts and parsing finished.

Asked By: Damiaan

||

Answers:

You can use a groupby:

import numpy as np

# make groups starting with "Parser starts"
group = df['description'].eq('Parser starts').cumsum()

# set up the grouper
g = df.groupby(group)

# update last value with ptp (= max - min)
df.loc[g.cumcount(ascending=False).eq(0),
       'difference'] = g['Timestamp'].transform(np.ptp)

output:

   Timestamp       description  difference
0          0     Parser starts         NaN
1         12           parsing         NaN
2         24           parsing         NaN
3         26  Parsing finished        26.0
4         28     Parser starts         NaN
5         45  Parsing finished        17.0

with filter

m1 = df['description'].eq('Parser starts')
m2 = df['description'].eq('Parsing finished')

g = df['Timestamp'].where(m1|m2).groupby(m1.cumsum())
df.loc[g.cumcount(ascending=False).eq(0),
       'difference'] = g.transform(lambda g: g.max()-g.min())
Answered By: mozway
def function1(dd:pd.DataFrame):
    dd.loc[dd.index.max(),'difference']=dd.Timestamp.max()-dd.Timestamp.min()
    return dd

df1.assign(col1=df1.description.eq('Parser starts').cumsum()).groupby('col1').apply(function1)

out:

  Timestamp       description  col1  difference
0          0     Parser starts     1         NaN
1         12           parsing     1         NaN
2         24           parsing     1         NaN
3         26  Parsing finished     1        26.0
4         28     Parser starts     2         NaN
5         45  Parsing finished     2        17.0
Answered By: G.G
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.