insert record to fill missing time window

Question:

I have a dataset with consecutive time periods corresponding with activities (drive, rest, charge etc). But there is no record for the night so the data is not continuous. I would like to add an extra record to fill this gap such that the start time of each record is always equal to the end time of the previous record. What is the best way to insert these records automatically (for different vehicle ID’s). My data looks like this now:

import pandas as pd
from io import StringIO

csv = """
id,starttime,endtime
1,2022-09-19 17:05:00,2022-09-19 17:26:00
1,2022-09-19 17:26:00,2022-09-19 18:38:00
1,2022-09-19 18:38:00,2022-09-19 19:31:00
1,2022-09-19 19:31:00,2022-09-19 19:38:00
1,2022-09-19 19:38:00,2022-09-19 19:40:00
1,2022-09-19 19:40:00,2022-09-19 19:41:00
1,2022-09-20 07:06:00,2022-09-20 07:06:00
1,2022-09-20 07:06:00,2022-09-20 07:23:00
1,2022-09-20 07:23:00,2022-09-20 07:26:00
1,2022-09-20 07:26:00,2022-09-20 07:37:00
"""

df = pd.read_csv(StringIO(csv))

And I would like to add the extra record:

1,2022-09-19 19:41:00,2022-09-20 07:06:00

(in the real case for multiple days and multiple id’s)

Asked By: pieterbons

||

Answers:

Annotated code

# Shift the rows in endtime per id
df['lag'] = df.groupby('id')['endtime'].shift()

# boolean condition to identify rows where startime
# of current row is not equal to end time of previous row
mask = (df['starttime'] !=  df['lag']) & df['lag'].notna()

# select the rows where condtion is True and set old starttime 
# to new endtime and lag to the new starttime
rows = df[mask].drop(columns=['endtime'])
rows = rows.rename(columns={'starttime': 'endtime', 'lag': 'starttime'})

# Realign index to ensure the order while sorting in next step
rows.index -= 1 

# append the new rows and sort the index
result = pd.concat([df, rows]).sort_index(ignore_index=True).drop(columns='lag')

Result

    id            starttime              endtime
0    1  2022-09-19 17:05:00  2022-09-19 17:26:00
1    1  2022-09-19 17:26:00  2022-09-19 18:38:00
2    1  2022-09-19 18:38:00  2022-09-19 19:31:00
3    1  2022-09-19 19:31:00  2022-09-19 19:38:00
4    1  2022-09-19 19:38:00  2022-09-19 19:40:00
5    1  2022-09-19 19:40:00  2022-09-19 19:41:00
6    1  2022-09-19 19:41:00  2022-09-20 07:06:00 # -- inserted row --
7    1  2022-09-20 07:06:00  2022-09-20 07:06:00
8    1  2022-09-20 07:06:00  2022-09-20 07:23:00
9    1  2022-09-20 07:23:00  2022-09-20 07:26:00
10   1  2022-09-20 07:26:00  2022-09-20 07:37:00
Answered By: Shubham Sharma

You can first find the maximum endtime (this will be the starttime of the row to be added) and minimum starttime (this will be the endtime) per day and second merge the minimum starttime that occurs after the maximum endtime.

import pandas as pd
from io import StringIO

csv = """
id,starttime,endtime
1,2022-09-19 17:05:00,2022-09-19 17:26:00
1,2022-09-19 17:26:00,2022-09-19 18:38:00
1,2022-09-19 18:38:00,2022-09-19 19:31:00
1,2022-09-19 19:31:00,2022-09-19 19:38:00
1,2022-09-19 19:38:00,2022-09-19 19:40:00
1,2022-09-19 19:40:00,2022-09-19 19:41:00
1,2022-09-20 07:06:00,2022-09-20 07:06:00
1,2022-09-20 07:06:00,2022-09-20 07:23:00
1,2022-09-20 07:23:00,2022-09-20 07:26:00
1,2022-09-20 07:26:00,2022-09-20 07:37:00
"""

df = pd.read_csv(StringIO(csv), parse_dates=['starttime','endtime'])

# find last endtime per day
# result will already be sorted as per requirement of pd.merge_asof
endtime = df.groupby([pd.Grouper(key='endtime', freq='D'),'id'])['endtime'].max()
endtime.index.rename(['day','id'], inplace=True)

# find first starttime per day
starttime = df.groupby([pd.Grouper(key='starttime', freq='D'), 'id'])['endtime'].min()
starttime.index.rename(['day', 'id'], inplace=True)

# conditional on the id, merge the first starttime that occurs after last endtime
# label `last endtime` as `starttime` and `first starttime` as `endtime`
overnight = pd.merge_asof(
    endtime.reset_index().rename(columns={'endtime':'starttime'}),
    starttime.reset_index().rename(columns={'starttime':'endtime'}),
    on='day',
    by='id',
    direction='forward',
    allow_exact_matches=False
)

# add
result = pd.concat([df, overnight[['id','starttime','endtime']]]).sort_values(by=['id','starttime']).dropna(subset=['starttime','endtime']).reset_index(drop=True)

This will produce

>>> result
    id           starttime             endtime
0    1 2022-09-19 17:05:00 2022-09-19 17:26:00
1    1 2022-09-19 17:26:00 2022-09-19 18:38:00
2    1 2022-09-19 18:38:00 2022-09-19 19:31:00
3    1 2022-09-19 19:31:00 2022-09-19 19:38:00
4    1 2022-09-19 19:38:00 2022-09-19 19:40:00
5    1 2022-09-19 19:40:00 2022-09-19 19:41:00
6    1 2022-09-19 19:41:00 2022-09-20 07:06:00 <----- added row
7    1 2022-09-20 07:06:00 2022-09-20 07:06:00
8    1 2022-09-20 07:06:00 2022-09-20 07:23:00
9    1 2022-09-20 07:23:00 2022-09-20 07:26:00
10   1 2022-09-20 07:26:00 2022-09-20 07:37:00
Answered By: Harry Haller
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.