How to get duration of a condition (in index length) in a pandas column vectorized

Question:

I have a data set with timeseries data. When a condition is met for a parameter I want to measure for how long that was.
I can for loop through all the positions where the condition changes but that seems to be inefficient.

What is the best way to do this vectorized?

Example:

import numpy as np
import pandas as pd

np.random.seed(0)

# generate dataset:
df = pd.DataFrame({'condition': np.random.randint(0, 2, 24)}, 
                  index = pd.date_range(start='2020', freq='M', periods=24))

df

enter image description here

Goal:
my goal is to create a column with the duration of sub sequential occurrences of ‘1’ in this example:
enter image description here

what I did so far:

# find start and end of condition:
ends = df[df.condition.diff() < 0].index
start = df[df.condition.diff() > 0].index[:ends.size]

# loop through starts and determine length
for s, e in zip(start, ends):
    df.loc[e, 'duration'] = e - s

# move 1 step back so it matches with last value position
df['duration'] = df.duration.shift(-1)

in this example this is pretty fast but the loop makes it slow with bigger datasets. What would be the fastest way to do something like this?

Asked By: n4321d

||

Answers:

One way I managed to vectorize it is using .ffill() to creat a temporary column and forward fill the start times in it. Then substract the start times from the end time:

ffill start times:

df.loc[start, 'temp'] = start
df.temp.ffill(inplace=True)

output:
enter image description here

Subtract start from end:

df.loc[ends, 'duration'] = ends - df.loc[ends, 'temp']

Output:
enter image description here

shift 1 step back to end:

df['duration'] = df.duration.shift(-1)

This ~1000x Faster on a dataframe with 1e5 rows:
enter image description here

But I still wonder if this could be further improved…

Answered By: n4321d

Mark the first '0' in a '0' group and mark the first '1' in a '1' group. Use .loc to select only those rows. Then do a diff on those. Use .loc to keep only the diffs for the first '0' rows. Then assign back to the original dataframe relying on the index to place the durations into the appropriate rows (almost). Then shift into proper place.

s0 = df['condition'].diff() < 0
s1 = df['condition'].diff() > 0

dfc = (
    df.assign(duration=df.loc[s0|s1].index.to_series().diff().loc[s0])
        .assign(duration=lambda x: x['duration'].shift(-1))
)

Result (see timings below this result section)

print(dfc)

            condition duration
2020-01-31          0      NaT
2020-02-29          1      NaT
2020-03-31          1  61 days
2020-04-30          0      NaT
2020-05-31          1      NaT
2020-06-30          1      NaT
2020-07-31          1      NaT
2020-08-31          1      NaT
2020-09-30          1      NaT
2020-10-31          1      NaT
2020-11-30          1 214 days
2020-12-31          0      NaT
2021-01-31          0      NaT
2021-02-28          1  31 days
2021-03-31          0      NaT
2021-04-30          0      NaT
2021-05-31          0      NaT
2021-06-30          0      NaT
2021-07-31          0      NaT
2021-08-31          1  30 days
2021-09-30          0      NaT
2021-10-31          1      NaT
2021-11-30          1  61 days
2021-12-31          0      NaT

Timing DataFrame

19,135 rows with unique index

Timing – Question – n4321d

%%timeit
ends = df[df.condition.diff() < 0].index
start = df[df.condition.diff() > 0].index[:ends.size]

# loop through starts and determine length
for s, e in zip(start, ends):
    df.loc[e, 'duration'] = e - s

# move 1 step back so it matches with last value position
df['duration'] = df.duration.shift(-1)

500 ms ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing – Answer – n4321d

%%timeit
ends = df[df.condition.diff() < 0].index
start = df[df.condition.diff() > 0].index[:ends.size]

df.loc[start, 'temp'] = start
df.temp.ffill(inplace=True)
df.loc[ends, 'duration'] = ends - df.loc[ends, 'temp']
df['duration'] = df.duration.shift(-1)

8.78 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timing – Answer – jch

%%timeit
s0 = df['condition'].diff() < 0
s1 = df['condition'].diff() > 0

dfc = (
    df.assign(duration=df.loc[s0|s1].index.to_series().diff().loc[s0])
        .assign(duration=lambda x: x['duration'].shift(-1))
)

5.96 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Answered By: jch

Inspired by jhc, even faster would be to find the start and end at once. Then create a new column with the start and end times. Then extract the differntial from that at the end times:

df.loc[df.condition.diff() != 0, 'temp'] = df.index[df.condition.diff() != 0]
df.loc[df.condition.diff() < 0, 'duration'] = df.temp.dropna().diff()
df['duration'] = df['duration'].shift(-1)

result

Time Test with 1000 rows
n4321d method:
5.53 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

time results n4321d

jhc method:
4.74 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

time results jhc

time my method:
3.35 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

time results my method

Answered By: ForDaC'de