Pandas start and duration to boolean time series

Question:

I have a dataset containing event occurrences, with a start time and duration. I’m trying to get this into a boolean format (i.e. True value when the event was occurring and a False when it was not).

Example dataframe:

time event_id duration
2 1 3
3 2 15
7 1 4
14 3 6

and my desired output would be:

time event_1 event_2 event3
0 0 0 0
1 0 0 0
2 1 0 0
3 1 1 0
4 1 1 0
5 0 1 0
16 0 1 1
17 0 1 1
18 0 0 1
19 0 0 1
20 0 0 0

My solution at the moment is this:

df = pd.DataFrame(columns=["time", "event_id", "duration"], data=[[2,1,3], [3,2,15], [7,1,4], [14,3,6]])
max_time = 20  # this is an external value but can assume no events happen after this
df = df.pivot(index="time", columns="event_id", values="duration")
t = np.arange(0, max_time+1, 1)
df = df.reindex(t)

for col in df:
    not_null = df.loc[~df[col].isna(), col]
    for start, length in not_null.items():
        df.loc[start:start+int(length)-1, col] = 1

df = df.fillna(0).astype(int)

This works but because I’m nesting a loop it doesn’t feel very scalable, and I want to be able to log hundreds of events over spans of several hours. Is there way I can do this with pandas functions to parallelise this more?

Asked By: Apollo42

||

Answers:

One approach could be:

df['rows'] = df[['time','duration']].apply(lambda row: list(range(row['time'], row['time']+row['duration'])), axis = 1)
df = df.explode('rows').pivot(index = 'rows', columns = 'event_id', values = 'time').notna().astype('int')

What this does is first create a new col rows which contains a list with a range of values that indicate int time period, these values are later used as index.

Then after exploding this column (i.e. creating a new row for each item in these lists), pivot the DataFrame to use 'rows' as index, 'event_id' as columns, and any other non-empty col as values. Empty values in the DataFrame after pivot mean there were no such records (and as such, the event was not present at that time period), so check for null values and convert result from bool to int.

The result:

rows    1   2   3
2   1   0   0
3   1   1   0
4   1   1   0
5   0   1   0
6   0   1   0
7   1   1   0
8   1   1   0
9   1   1   0
10  1   1   0
11  0   1   0
12  0   1   0
13  0   1   0
14  0   1   1
15  0   1   1
16  0   1   1
17  0   1   1
18  0   0   1
19  0   0   1
Answered By: dm2

Use Index.repeat with DataFrame.loc for duplicated values, add counter by GroupBy.cumcount, reshape by crosstab – if some puplicates values in real data compare by 0 and cast to integers, last add non exist rows by DataFrame.reindex with DataFrame.add_prefix:

Notice: If need count values instead set 1 remove .ne(0).astype(int)

max_time = 20  # this is an external value but can assume no events happen after this
t = np.arange(max_time+1)

df = df.loc[df.index.repeat(df['duration'])]
df['time'] = df['time'].add(df.groupby(level=0).cumcount())

df = (pd.crosstab(df["time"],df["event_id"]).ne(0).astype(int)
        .reindex(t, fill_value=0)
        .add_prefix('event'))
print (df)
event_id  event1  event2  event3
time                            
0              0       0       0
1              0       0       0
2              1       0       0
3              1       1       0
4              1       1       0
5              0       1       0
6              0       1       0
7              1       1       0
8              1       1       0
9              1       1       0
10             1       1       0
11             0       1       0
12             0       1       0
13             0       1       0
14             0       1       1
15             0       1       1
16             0       1       1
17             0       1       1
18             0       0       1
19             0       0       1
20             0       0       0

EDIT: Vectorized solution: Compare forward filled missing values by counter of all missing values:

#your solution
max_time = 20  # this is an external value but can assume no events happen after this
df = df.pivot(index="time", columns="event_id", values="duration")
t = np.arange(0, max_time+1, 1)
df = df.reindex(t)
print (df)
event_id    1     2    3
time                    
0         NaN   NaN  NaN
1         NaN   NaN  NaN
2         3.0   NaN  NaN
3         NaN  15.0  NaN
4         NaN   NaN  NaN
5         NaN   NaN  NaN
6         NaN   NaN  NaN
7         4.0   NaN  NaN
8         NaN   NaN  NaN
9         NaN   NaN  NaN
10        NaN   NaN  NaN
11        NaN   NaN  NaN
12        NaN   NaN  NaN
13        NaN   NaN  NaN
14        NaN   NaN  6.0
15        NaN   NaN  NaN
16        NaN   NaN  NaN
17        NaN   NaN  NaN
18        NaN   NaN  NaN
19        NaN   NaN  NaN
20        NaN   NaN  NaN

#https://stackoverflow.com/a/48659066/2901002
a = df.isna()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill()).lt(df.ffill()).astype(int).add_prefix('event')
print (df1)
event_id  event1  event2  event3
time                            
0              0       0       0
1              0       0       0
2              1       0       0
3              1       1       0
4              1       1       0
5              0       1       0
6              0       1       0
7              1       1       0
8              1       1       0
9              1       1       0
10             1       1       0
11             0       1       0
12             0       1       0
13             0       1       0
14             0       1       1
15             0       1       1
16             0       1       1
17             0       1       1
18             0       0       1
19             0       0       1
20             0       0       0
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.