Breakdown data-frame into second-by-second time-series

Question:

I have this dataset of active subjects during specified time-periods.

    start    end name
0   00:00  00:10    a
1   00:10  00:20    b
2   00:00  00:20    c
3   00:00  00:10    d
4   00:10  00:15    e
5   00:15  00:20    a

The intervals are inclusive on the left(start) side and not inclusive on the right(end).

There are always three subjects active. I want to increase the granularity of the data, so that I will have info of the three active subjects for each second. Each second has three unique values.

This would be the desired result for the test case.

   slot1 slot2 slot3
0      a     c     d
1      a     c     d
2      a     c     d
3      a     c     d
4      a     c     d
5      a     c     d
6      a     c     d
7      a     c     d
8      a     c     d
9      a     c     d
10     b     c     e
11     b     c     e
12     b     c     e
13     b     c     e
14     b     c     e
15     b     c     a
16     b     c     a
17     b     c     a
18     b     c     a
19     b     c     a

The order of the subjects inside the slots is irrelevant for now. The subjects can reappear in the data like "a" from 00:00 to 00:10 and then again from 00:15 to 00:20. The intervals can be at any second.

Asked By: Borut Flis

||

Answers:

Route 1: One (costly but easy) way is to explode the data to the seconds, then merge 3 times:

time_df = (('00:' + df[['start','end']])
               .apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
               .astype(int)
               .apply(lambda x: np.arange(*x), axis=1)
               .to_frame('time')
               .assign(slot=df['name'])
               .explode('time')
          )

(time_df.merge(time_df, on='time', suffixes=['1','2'])
     .query('slot1 < slot2')
     .merge(time_df, on='time')
     .query('slot2 < slot')
)

Output:

    time slot1 slot2 slot
2      0     a     c    d
11     1     a     c    d
20     2     a     c    d
29     3     a     c    d
38     4     a     c    d
47     5     a     c    d
56     6     a     c    d
65     7     a     c    d
74     8     a     c    d
83     9     a     c    d
92    10     b     c    e
101   11     b     c    e
110   12     b     c    e
119   13     b     c    e
128   14     b     c    e
139   15     a     b    c
148   16     a     b    c
157   17     a     b    c
166   18     a     b    c
175   19     a     b    c

Route 2: Another way is to cross merge then query the overlapping intervals:

df[['start','end']] = (('00:' + df[['start','end']])
               .apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
               .astype(int)
          )

(df.merge(df, how='cross')
   .assign(start=lambda x: x.filter(like='start').max(axis=1),
           end=lambda x: x.filter(like='end').min(axis=1))
   .query('start < end & name_x < name_y')
   [['name_x','name_y','start','end']]
   .merge(df, how='cross')
   .assign(start=lambda x: x.filter(like='start').max(axis=1),
           end=lambda x: x.filter(like='end').min(axis=1))
   .query('start < end & name_y < name')
   [['start','end', 'name_x','name_y', 'name']]
)

Output:

    start  end name_x name_y name
3       0   10      a      c    d
16     10   15      b      c    e
38     15   20      a      b    c

As you can see the this output is just the same as the other, but in the original form. Depending on your data, one route might better than the other.


Update Since your data has exactly 3 slot at any time, you can easily do with pivot. This is the best solution.

# time_df as in Route 1
(time_df.sort_values(['time','slot'])
    .assign(nums = lambda x: np.arange(len(x)) % 3)
    .pivot('time', 'nums', 'slot')
)
# in general, `.assign(nums=lambda x: x.groupby('time').cumcount()`
# also works instead of the above

Output:

nums  0  1  2
time         
0     a  c  d
1     a  c  d
2     a  c  d
3     a  c  d
4     a  c  d
5     a  c  d
6     a  c  d
7     a  c  d
8     a  c  d
9     a  c  d
10    b  c  e
11    b  c  e
12    b  c  e
13    b  c  e
14    b  c  e
15    a  b  c
16    a  b  c
17    a  b  c
18    a  b  c
19    a  b  c
Answered By: Quang Hoang

This solution uses piso (pandas interval set operations) and will run fast.

setup

Create data and convert to pandas.Timedelta

df = pd.DataFrame(
    {
        "start": ["00:00", "00:10", "00:00", "00:00", "00:10", "00:15"],
        "end": ["00:10", "00:20", "00:20", "00:10", "00:15", "00:20"],
        "name": ["a", "b", "c", "d", "e", "a"],
    }
)
df[["start", "end"]] = ("00:" + df[["start", "end"]].astype(str)).apply(pd.to_timedelta)

create the sample times (a pandas.TimedeltaIndex of seconds):

sample_times = pd.timedelta_range(df["start"].min(), df["end"].max(), freq="s")

solution

For each possible value of "name" create a pandas.IntervalIndex which has the intervals defined by start and stop columns:

ii_series = df.groupby("name").apply(
    lambda d: pd.IntervalIndex.from_arrays(d["start"], d["end"], closed="left")
)

ii_series looks like this:

name
a    IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
b    IntervalIndex([[0 days 00:10:00, 0 days 00:20:...
c    IntervalIndex([[0 days 00:00:00, 0 days 00:20:...
d    IntervalIndex([[0 days 00:00:00, 0 days 00:10:...
e    IntervalIndex([[0 days 00:10:00, 0 days 00:15:...
dtype: object

Then to each of these interval index we’ll apply the piso.contains function, which can be used to test whether a set of points is contained in an interval

contained = ii_series.apply(piso.contains,x=sample_times, result="points")

contained will be a dataframe indexed by the names, and whose columns are the sample times. The transpose of this, looks like:

                     a      b      c      d      e
0 days 00:00:00   True  False   True   True  False
0 days 00:00:01   True  False   True   True  False
0 days 00:00:02   True  False   True   True  False
0 days 00:00:03   True  False   True   True  False
0 days 00:00:04   True  False   True   True  False
...                ...    ...    ...    ...    ...
0 days 00:19:56   True   True   True  False  False
0 days 00:19:57   True   True   True  False  False
0 days 00:19:58   True   True   True  False  False
0 days 00:19:59   True   True   True  False  False
0 days 00:20:00  False  False  False  False  False

This format of data may be easier to work with, depending on the application, but if you want to have it in the format stated in the question then you can create a series of lists, indexed by each second:

series_of_lists = (
    contained.transpose()
    .melt(ignore_index=False)
    .query("value == True")
    .reset_index()
    .groupby("index")["name"]
    .apply(pd.Series.to_list)
)

Then convert to dataframe:

pd.DataFrame(series_of_lists.to_list(), index=series_of_lists.index)

which will look like this:

                 0  1  2
          index                   
0 days 00:00:00  a  c  d
0 days 00:00:01  a  c  d
0 days 00:00:02  a  c  d
0 days 00:00:03  a  c  d
0 days 00:00:04  a  c  d
...             .. .. ..
0 days 00:19:55  a  b  c
0 days 00:19:56  a  b  c
0 days 00:19:57  a  b  c
0 days 00:19:58  a  b  c
0 days 00:19:59  a  b  c

Note: I am the creator of piso, feel free to reach out if you have any questions.

Answered By: Riley