Faster way to add a row in between a time series Python
Question:
I have a dataframe that has one of the columns as ‘date’.
It contains datetime value in the format 2020-11-04 09:15:00+05:30
for 45 days.
The data for a day starts at 9:15:00
and ends at 18:30:00
.
Apart from the date, there is an x
column and a y
column.
I want to insert a new row of 09:14:00
just before 9:15:00
.
Here x
of the new row will be the y
of previous row i.e. 18:30:00
of previous day.
And, y
of the new row will be the x
of next row i.e. 09:15:00
of same day.
I tried the below code, and the answer is both wrong and also very slow.
def add_one_row(df):
new_df = pd.DataFrame(columns=df.columns)
for i, row in df.iterrows():
if i != 0 and row['date'].time() == pd.to_datetime('09:15:00').time():
new_row = row.copy()
new_row['date'] = new_row['date'].replace(minute=14, second=0)
new_row['x'] = df.loc[i-1, 'y']
new_row['y'] = row['x']
new_df = pd.concat([new_df, new_row])
new_df = pd.concat([new_df, row])
return new_df
I expected the row
and new_row
to be concated as a row to the new_df
. However, it is creating a column with name 0.
date | x | y | 0
date 2020-11-04 09:15:00+05:30
x 50
y 60
It is really slow so I need to do it faster maybe with vectorization.
Can someone provide a faster and correct way to solve this?
Answers:
If you set the date as the index, you can use .at_time()
to easily access the 09:15:00
rows.
From there you can look back for the previous 18:30:00
rows and then take their y
values.
new_df = df.set_index("date")
start_rows = new_df.at_time("09:15:00").tail(-1) # skip the first
# `18:30` rows
end_rows = new_df.loc[
start_rows.index - pd.DateOffset(hours=14, minutes=45)
]
# change to `09:14:00`
start_rows.index -= pd.DateOffset(minutes=1)
start_rows["y"] = start_rows["x"]
start_rows.update(end_rows["y"].rename("x").set_axis(start_rows.index))
new_df = pd.concat([new_df, start_rows]).sort_index().reset_index()
IIUC, you can use the following vectorized code:
from datetime import time
# Convert to datetime if necessary
df['date'] = pd.to_datetime(df['date'])
# Compute a new dataframe with right values and keep only 09:14 rows
df914 = (df.assign(x=df['y'].shift(fill_value=0), y=df['x'],
date=df['date'] - pd.DateOffset(minutes=1))
.loc[lambda x: x['date'].dt.time == time(9, 14)])
# Concatenate both dataframe and reorder index
out = pd.concat([df914, df], axis=0).sort_index(kind='stable', ignore_index=True)
Output:
>>> out
date x y
0 2020-11-04 09:14:00+05:30 0 10 # Added
1 2020-11-04 09:15:00+05:30 10 20
2 2020-11-04 12:00:00+05:30 11 21
3 2020-11-04 18:30:00+05:30 12 22
4 2020-11-05 09:14:00+05:30 22 13 # Added
5 2020-11-05 09:15:00+05:30 13 23
6 2020-11-05 12:00:00+05:30 14 24
7 2020-11-05 18:30:00+05:30 15 25
data = {'date': ['2020-11-04 09:15:00+05:30', '2020-11-04 12:00:00+05:30',
'2020-11-04 18:30:00+05:30', '2020-11-05 09:15:00+05:30',
'2020-11-05 12:00:00+05:30', '2020-11-05 18:30:00+05:30'],
'x': [10, 11, 12, 13, 14, 15], 'y': [20, 21, 22, 23, 24, 25]}
df = pd.DataFrame(data)
I have a dataframe that has one of the columns as ‘date’.
It contains datetime value in the format 2020-11-04 09:15:00+05:30
for 45 days.
The data for a day starts at 9:15:00
and ends at 18:30:00
.
Apart from the date, there is an x
column and a y
column.
I want to insert a new row of 09:14:00
just before 9:15:00
.
Here x
of the new row will be the y
of previous row i.e. 18:30:00
of previous day.
And, y
of the new row will be the x
of next row i.e. 09:15:00
of same day.
I tried the below code, and the answer is both wrong and also very slow.
def add_one_row(df):
new_df = pd.DataFrame(columns=df.columns)
for i, row in df.iterrows():
if i != 0 and row['date'].time() == pd.to_datetime('09:15:00').time():
new_row = row.copy()
new_row['date'] = new_row['date'].replace(minute=14, second=0)
new_row['x'] = df.loc[i-1, 'y']
new_row['y'] = row['x']
new_df = pd.concat([new_df, new_row])
new_df = pd.concat([new_df, row])
return new_df
I expected the row
and new_row
to be concated as a row to the new_df
. However, it is creating a column with name 0.
date | x | y | 0
date 2020-11-04 09:15:00+05:30
x 50
y 60
It is really slow so I need to do it faster maybe with vectorization.
Can someone provide a faster and correct way to solve this?
If you set the date as the index, you can use .at_time()
to easily access the 09:15:00
rows.
From there you can look back for the previous 18:30:00
rows and then take their y
values.
new_df = df.set_index("date")
start_rows = new_df.at_time("09:15:00").tail(-1) # skip the first
# `18:30` rows
end_rows = new_df.loc[
start_rows.index - pd.DateOffset(hours=14, minutes=45)
]
# change to `09:14:00`
start_rows.index -= pd.DateOffset(minutes=1)
start_rows["y"] = start_rows["x"]
start_rows.update(end_rows["y"].rename("x").set_axis(start_rows.index))
new_df = pd.concat([new_df, start_rows]).sort_index().reset_index()
IIUC, you can use the following vectorized code:
from datetime import time
# Convert to datetime if necessary
df['date'] = pd.to_datetime(df['date'])
# Compute a new dataframe with right values and keep only 09:14 rows
df914 = (df.assign(x=df['y'].shift(fill_value=0), y=df['x'],
date=df['date'] - pd.DateOffset(minutes=1))
.loc[lambda x: x['date'].dt.time == time(9, 14)])
# Concatenate both dataframe and reorder index
out = pd.concat([df914, df], axis=0).sort_index(kind='stable', ignore_index=True)
Output:
>>> out
date x y
0 2020-11-04 09:14:00+05:30 0 10 # Added
1 2020-11-04 09:15:00+05:30 10 20
2 2020-11-04 12:00:00+05:30 11 21
3 2020-11-04 18:30:00+05:30 12 22
4 2020-11-05 09:14:00+05:30 22 13 # Added
5 2020-11-05 09:15:00+05:30 13 23
6 2020-11-05 12:00:00+05:30 14 24
7 2020-11-05 18:30:00+05:30 15 25
data = {'date': ['2020-11-04 09:15:00+05:30', '2020-11-04 12:00:00+05:30',
'2020-11-04 18:30:00+05:30', '2020-11-05 09:15:00+05:30',
'2020-11-05 12:00:00+05:30', '2020-11-05 18:30:00+05:30'],
'x': [10, 11, 12, 13, 14, 15], 'y': [20, 21, 22, 23, 24, 25]}
df = pd.DataFrame(data)