Pandas start and duration to boolean time series
Question:
I have a dataset containing event occurrences, with a start time and duration. I’m trying to get this into a boolean format (i.e. True value when the event was occurring and a False when it was not).
Example dataframe:
time
event_id
duration
2
1
3
3
2
15
7
1
4
14
3
6
and my desired output would be:
time
event_1
event_2
event3
0
0
0
0
1
0
0
0
2
1
0
0
3
1
1
0
4
1
1
0
5
0
1
0
…
…
…
…
16
0
1
1
17
0
1
1
18
0
0
1
19
0
0
1
20
0
0
0
My solution at the moment is this:
df = pd.DataFrame(columns=["time", "event_id", "duration"], data=[[2,1,3], [3,2,15], [7,1,4], [14,3,6]])
max_time = 20 # this is an external value but can assume no events happen after this
df = df.pivot(index="time", columns="event_id", values="duration")
t = np.arange(0, max_time+1, 1)
df = df.reindex(t)
for col in df:
not_null = df.loc[~df[col].isna(), col]
for start, length in not_null.items():
df.loc[start:start+int(length)-1, col] = 1
df = df.fillna(0).astype(int)
This works but because I’m nesting a loop it doesn’t feel very scalable, and I want to be able to log hundreds of events over spans of several hours. Is there way I can do this with pandas functions to parallelise this more?
Answers:
One approach could be:
df['rows'] = df[['time','duration']].apply(lambda row: list(range(row['time'], row['time']+row['duration'])), axis = 1)
df = df.explode('rows').pivot(index = 'rows', columns = 'event_id', values = 'time').notna().astype('int')
What this does is first create a new col rows
which contains a list with a range of values that indicate int
time period, these values are later used as index.
Then after exploding this column (i.e. creating a new row for each item in these lists), pivot the DataFrame to use 'rows'
as index, 'event_id'
as columns, and any other non-empty col as values. Empty values in the DataFrame after pivot mean there were no such records (and as such, the event was not present at that time period), so check for null values and convert result from bool
to int
.
The result:
rows 1 2 3
2 1 0 0
3 1 1 0
4 1 1 0
5 0 1 0
6 0 1 0
7 1 1 0
8 1 1 0
9 1 1 0
10 1 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 1 1
15 0 1 1
16 0 1 1
17 0 1 1
18 0 0 1
19 0 0 1
Use Index.repeat
with DataFrame.loc
for duplicated values, add counter by GroupBy.cumcount
, reshape by crosstab
– if some puplicates values in real data compare by 0
and cast to integers, last add non exist rows by DataFrame.reindex
with DataFrame.add_prefix
:
Notice: If need count values instead set 1
remove .ne(0).astype(int)
max_time = 20 # this is an external value but can assume no events happen after this
t = np.arange(max_time+1)
df = df.loc[df.index.repeat(df['duration'])]
df['time'] = df['time'].add(df.groupby(level=0).cumcount())
df = (pd.crosstab(df["time"],df["event_id"]).ne(0).astype(int)
.reindex(t, fill_value=0)
.add_prefix('event'))
print (df)
event_id event1 event2 event3
time
0 0 0 0
1 0 0 0
2 1 0 0
3 1 1 0
4 1 1 0
5 0 1 0
6 0 1 0
7 1 1 0
8 1 1 0
9 1 1 0
10 1 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 1 1
15 0 1 1
16 0 1 1
17 0 1 1
18 0 0 1
19 0 0 1
20 0 0 0
EDIT: Vectorized solution: Compare forward filled missing values by counter of all missing values:
#your solution
max_time = 20 # this is an external value but can assume no events happen after this
df = df.pivot(index="time", columns="event_id", values="duration")
t = np.arange(0, max_time+1, 1)
df = df.reindex(t)
print (df)
event_id 1 2 3
time
0 NaN NaN NaN
1 NaN NaN NaN
2 3.0 NaN NaN
3 NaN 15.0 NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 4.0 NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN 6.0
15 NaN NaN NaN
16 NaN NaN NaN
17 NaN NaN NaN
18 NaN NaN NaN
19 NaN NaN NaN
20 NaN NaN NaN
#https://stackoverflow.com/a/48659066/2901002
a = df.isna()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill()).lt(df.ffill()).astype(int).add_prefix('event')
print (df1)
event_id event1 event2 event3
time
0 0 0 0
1 0 0 0
2 1 0 0
3 1 1 0
4 1 1 0
5 0 1 0
6 0 1 0
7 1 1 0
8 1 1 0
9 1 1 0
10 1 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 1 1
15 0 1 1
16 0 1 1
17 0 1 1
18 0 0 1
19 0 0 1
20 0 0 0
I have a dataset containing event occurrences, with a start time and duration. I’m trying to get this into a boolean format (i.e. True value when the event was occurring and a False when it was not).
Example dataframe:
time | event_id | duration |
---|---|---|
2 | 1 | 3 |
3 | 2 | 15 |
7 | 1 | 4 |
14 | 3 | 6 |
and my desired output would be:
time | event_1 | event_2 | event3 |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 |
3 | 1 | 1 | 0 |
4 | 1 | 1 | 0 |
5 | 0 | 1 | 0 |
… | … | … | … |
16 | 0 | 1 | 1 |
17 | 0 | 1 | 1 |
18 | 0 | 0 | 1 |
19 | 0 | 0 | 1 |
20 | 0 | 0 | 0 |
My solution at the moment is this:
df = pd.DataFrame(columns=["time", "event_id", "duration"], data=[[2,1,3], [3,2,15], [7,1,4], [14,3,6]])
max_time = 20 # this is an external value but can assume no events happen after this
df = df.pivot(index="time", columns="event_id", values="duration")
t = np.arange(0, max_time+1, 1)
df = df.reindex(t)
for col in df:
not_null = df.loc[~df[col].isna(), col]
for start, length in not_null.items():
df.loc[start:start+int(length)-1, col] = 1
df = df.fillna(0).astype(int)
This works but because I’m nesting a loop it doesn’t feel very scalable, and I want to be able to log hundreds of events over spans of several hours. Is there way I can do this with pandas functions to parallelise this more?
One approach could be:
df['rows'] = df[['time','duration']].apply(lambda row: list(range(row['time'], row['time']+row['duration'])), axis = 1)
df = df.explode('rows').pivot(index = 'rows', columns = 'event_id', values = 'time').notna().astype('int')
What this does is first create a new col rows
which contains a list with a range of values that indicate int
time period, these values are later used as index.
Then after exploding this column (i.e. creating a new row for each item in these lists), pivot the DataFrame to use 'rows'
as index, 'event_id'
as columns, and any other non-empty col as values. Empty values in the DataFrame after pivot mean there were no such records (and as such, the event was not present at that time period), so check for null values and convert result from bool
to int
.
The result:
rows 1 2 3
2 1 0 0
3 1 1 0
4 1 1 0
5 0 1 0
6 0 1 0
7 1 1 0
8 1 1 0
9 1 1 0
10 1 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 1 1
15 0 1 1
16 0 1 1
17 0 1 1
18 0 0 1
19 0 0 1
Use Index.repeat
with DataFrame.loc
for duplicated values, add counter by GroupBy.cumcount
, reshape by crosstab
– if some puplicates values in real data compare by 0
and cast to integers, last add non exist rows by DataFrame.reindex
with DataFrame.add_prefix
:
Notice: If need count values instead set 1
remove .ne(0).astype(int)
max_time = 20 # this is an external value but can assume no events happen after this
t = np.arange(max_time+1)
df = df.loc[df.index.repeat(df['duration'])]
df['time'] = df['time'].add(df.groupby(level=0).cumcount())
df = (pd.crosstab(df["time"],df["event_id"]).ne(0).astype(int)
.reindex(t, fill_value=0)
.add_prefix('event'))
print (df)
event_id event1 event2 event3
time
0 0 0 0
1 0 0 0
2 1 0 0
3 1 1 0
4 1 1 0
5 0 1 0
6 0 1 0
7 1 1 0
8 1 1 0
9 1 1 0
10 1 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 1 1
15 0 1 1
16 0 1 1
17 0 1 1
18 0 0 1
19 0 0 1
20 0 0 0
EDIT: Vectorized solution: Compare forward filled missing values by counter of all missing values:
#your solution
max_time = 20 # this is an external value but can assume no events happen after this
df = df.pivot(index="time", columns="event_id", values="duration")
t = np.arange(0, max_time+1, 1)
df = df.reindex(t)
print (df)
event_id 1 2 3
time
0 NaN NaN NaN
1 NaN NaN NaN
2 3.0 NaN NaN
3 NaN 15.0 NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 4.0 NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN 6.0
15 NaN NaN NaN
16 NaN NaN NaN
17 NaN NaN NaN
18 NaN NaN NaN
19 NaN NaN NaN
20 NaN NaN NaN
#https://stackoverflow.com/a/48659066/2901002
a = df.isna()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill()).lt(df.ffill()).astype(int).add_prefix('event')
print (df1)
event_id event1 event2 event3
time
0 0 0 0
1 0 0 0
2 1 0 0
3 1 1 0
4 1 1 0
5 0 1 0
6 0 1 0
7 1 1 0
8 1 1 0
9 1 1 0
10 1 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 1 1
15 0 1 1
16 0 1 1
17 0 1 1
18 0 0 1
19 0 0 1
20 0 0 0