Faster way to add a row in between a time series Python

Question:

I have a dataframe that has one of the columns as ‘date’.

It contains datetime value in the format 2020-11-04 09:15:00+05:30 for 45 days.

The data for a day starts at 9:15:00 and ends at 18:30:00.

Apart from the date, there is an x column and a y column.

I want to insert a new row of 09:14:00 just before 9:15:00.

Here x of the new row will be the y of previous row i.e. 18:30:00 of previous day.

And, y of the new row will be the x of next row i.e. 09:15:00 of same day.

I tried the below code, and the answer is both wrong and also very slow.

def add_one_row(df):
    
    new_df = pd.DataFrame(columns=df.columns)

    for i, row in df.iterrows():
        
        if i != 0 and row['date'].time() == pd.to_datetime('09:15:00').time():
            
            new_row = row.copy()
            new_row['date'] = new_row['date'].replace(minute=14, second=0)
            new_row['x'] = df.loc[i-1, 'y']
            new_row['y'] = row['x']
            new_df = pd.concat([new_df, new_row])
            
        new_df = pd.concat([new_df, row])
        
    return new_df

I expected the row and new_row to be concated as a row to the new_df. However, it is creating a column with name 0.


     date | x | y | 0
date              2020-11-04 09:15:00+05:30
x                 50
y                 60

It is really slow so I need to do it faster maybe with vectorization.

Can someone provide a faster and correct way to solve this?

Asked By: Ash

||

Answers:

If you set the date as the index, you can use .at_time() to easily access the 09:15:00 rows.

From there you can look back for the previous 18:30:00 rows and then take their y values.

new_df = df.set_index("date")

start_rows = new_df.at_time("09:15:00").tail(-1) # skip the first

# `18:30` rows
end_rows = new_df.loc[ 
   start_rows.index - pd.DateOffset(hours=14, minutes=45)
]

# change to `09:14:00`
start_rows.index -= pd.DateOffset(minutes=1) 
start_rows["y"] = start_rows["x"]

start_rows.update(end_rows["y"].rename("x").set_axis(start_rows.index))

new_df = pd.concat([new_df, start_rows]).sort_index().reset_index()
Answered By: jqurious

IIUC, you can use the following vectorized code:

from datetime import time

# Convert to datetime if necessary
df['date'] = pd.to_datetime(df['date'])

# Compute a new dataframe with right values and keep only 09:14 rows
df914 = (df.assign(x=df['y'].shift(fill_value=0), y=df['x'],
                   date=df['date'] - pd.DateOffset(minutes=1))
           .loc[lambda x: x['date'].dt.time == time(9, 14)])

# Concatenate both dataframe and reorder index
out = pd.concat([df914, df], axis=0).sort_index(kind='stable', ignore_index=True)

Output:

>>> out
                       date   x   y
0 2020-11-04 09:14:00+05:30   0  10  # Added
1 2020-11-04 09:15:00+05:30  10  20
2 2020-11-04 12:00:00+05:30  11  21
3 2020-11-04 18:30:00+05:30  12  22
4 2020-11-05 09:14:00+05:30  22  13  # Added
5 2020-11-05 09:15:00+05:30  13  23
6 2020-11-05 12:00:00+05:30  14  24
7 2020-11-05 18:30:00+05:30  15  25

Minimal Reproducible Example

data = {'date': ['2020-11-04 09:15:00+05:30', '2020-11-04 12:00:00+05:30',
                 '2020-11-04 18:30:00+05:30', '2020-11-05 09:15:00+05:30',
                 '2020-11-05 12:00:00+05:30', '2020-11-05 18:30:00+05:30'],
        'x': [10, 11, 12, 13, 14, 15], 'y': [20, 21, 22, 23, 24, 25]}
df = pd.DataFrame(data)
Answered By: Corralien