How to merge data in DataFrame in overlapping time periods in pandas?

Question:

I have a pandas DataFrame like the following:

           start_time status               duration
0 2023-03-16 01:30:00     OK        0 days 00:02:00          
1 2023-03-16 01:31:00   WARN        0 days 00:19:28  
2 2023-03-16 01:31:00  ERROR        0 days 00:09:28  
3 2023-03-16 01:32:00  ERROR        0 days 00:03:00
4 2023-03-16 01:33:00     OK        0 days 00:03:00
5 2023-03-16 01:35:00     OK        0 days 12:05:28 

It has a list of status, and for how long each status lasts, the time periods could be overlapped.

I need to merge the time periods and get the worst status when they overlap to form a new time frame, which should look like:

           start_time status        
0 2023-03-16 01:30:00     OK                
1 2023-03-16 01:31:00  ERROR
2 2023-03-16 01:40:28   WARN
3 2023-03-16 01:50:28     OK

It started with OK status, the next period in WARN overlapped with ERROR so we keep the ERROR, and when the period in ERROR finished change back to WARN, and then back to OK. Rows 3 4 are skipped due to overlapped with ERROR in row 2.

The idea I have is for each row in the data frame, calculate an end_time based on start_time + duration, use the start_time and end_time to create a time series with 10s timedelta, explode the time series to a data frame and fill forward with status, then align or merge_asof the dataframes pairwise and create a function to aggregate the status by worst case… sounds very inefficient and slow, it has 100k data rows.

Asked By: hejy

||

Answers:

First, I calculate the end_times. Then I collect all the times (except the last time, since that’s the final end_time) and map the most important event to that time and store that as a events. Then, I remove duplicate consecutive rows by comparing it to itself shifted by one.

df = pd.DataFrame(
    {
        "start_time": [
            "2023-03-16 01:30:00",
            "2023-03-16 01:31:00",
            "2023-03-16 01:31:00",
            "2023-03-16 01:32:00",
            "2023-03-16 01:33:00",
            "2023-03-16 01:35:00",
        ],
        "status": ["OK", "WARN", "ERROR", "ERROR", "OK", "OK"],
        "duration": [
            "00:02:00",
            "00:19:28",
            "00:09:28",
            "00:03:03",
            "00:03:00",
            "12:05:28",
        ],
    }
)
df["start_time"] = df["start_time"].astype("datetime64")
df["duration"] = pd.to_timedelta(df["duration"])

df["end_time"] = df["start_time"] + df["duration"]

priority = {"ERROR": 0, "WARN": 1, "OK": 2}

times = np.union1d(df["start_time"], df["end_time"])

events = pd.DataFrame.from_dict(
    {
        time: min(
            df.loc[(df.start_time <= time) & (df.end_time > time), "status"].values,
            key=lambda x: priority[x],
        )
        for time in times[:-1]
    },
    "index",
)

display(events)

events = events[events[0] != events[0].shift()]

display(events)

Output:

                    0
2023-03-16 01:30:00 OK
2023-03-16 01:31:00 ERROR
2023-03-16 01:40:28 WARN
2023-03-16 01:50:28 OK
Answered By: Michael Cao
import pandas as pd
import datetime

df = pd.DataFrame({'start_time': ['2023-03-16 01:30:00 ', '2023-03-16 01:31:00', '2023-03-16 01:31:00', 
                                  '2023-03-16 01:32:00', '2023-03-16 01:33:00', '2023-03-16 01:35:00'], 
                   'status': ["OK", "WARN", "ERROR", "ERROR", "OK", "OK"], 
                   'duration': ["0 days 00:02:00", "0 days 00:19:28", "0 days 00:09:28", 
                                "0 days 00:03:00", "0 days 00:03:00", "0 days 12:05:28"], 
                   })

df['diff'] = ( pd.to_timedelta(df['duration']) + pd.to_datetime(df['start_time']) ).diff()

def reduce(g):
    d = {False:g, True:g.iloc[:-1,:]}[len(g)>1]
    mask = d['diff'].gt(datetime.timedelta(0)) 
    
    if len(g)>1:
        return pd.concat([d.loc[d['diff'][mask].index], d.tail(1)], axis=0)
    else:
        return d
        
r = (df.groupby('status')
       .apply(lambda g: reduce(g))
       .reset_index(allow_duplicates=True)
       .T
       .drop_duplicates(keep='first')
       .T
       .drop('diff',axis=1)
       )

r['start_time'] = ( pd.to_timedelta(r['duration']) + pd.to_datetime(r['start_time']) )

print(r[['start_time', 'status']])

Result

           start_time status
0 2023-03-16 01:40:28  ERROR
1 2023-03-16 01:36:00     OK
2 2023-03-16 01:36:00     OK
3 2023-03-16 01:50:28   WARN
Answered By: Laurent B.