Pandas: easier way to sample interpolated time series data at given times (e.g. every full day)
Question:
Regularly I run into the problem that I have time series data that I want to interpolate and resample at given times. I have a solution, but it feels like "too labor intensive", e.g. I guess there should be a simpler way. Have a look for how I currently do it here: https://gist.github.com/cs224/012f393d5ced6931ae223e6ddc4fe6b2 (or the nicer version via nbviewer here: https://nbviewer.org/gist/cs224/012f393d5ced6931ae223e6ddc4fe6b2)
Perhaps a motivating example: I fill up my car about every two weeks. I have the cost data of every refill. Now I would like to know the cummulative sum on a daily basis, where the day values are at midnight and interpolated.
Currently I create a new empty data frame that contains the time points at which I want to have my resampled values:
df_sampling = pd.DataFrame(index=pd.date_range(start, end, freq=freq))
And then either use pd.merge
:
ldf = pd.merge(df_in, df_sampling, left_index=True, right_index=True, how='outer')
or pd.concat
:
ldf = pd.concat([df_in, df_sampling], axis=1)
to create a combined time series that has the additional time points in the index. Based on that I can then use pd.interpolate
and then sub-select all index values given by df_sampling. See the gist for details.
All this feels too cumbersome and I guess there should be a better way how to do it.
Answers:
Instead of using either merge
or concat
inside your function generate_interpolated_time_series
, I would rely on df.reindex
. Something like this:
def f(df_in, freq='T', start=None):
if start is None:
start = df_in.index[0].floor('T')
# refactored: df_in.index[0].replace(second=0,microsecond=0,nanosecond=0)
end = df_in.index[-1]
idx = pd.date_range(start=start, end=end, freq=freq)
ldf = df_in.reindex(df_in.index.union(idx)).interpolate().bfill()
ldf = ldf[~ldf.index.isin(df_in.index.difference(idx))]
return ldf
Test sample:
from pandas import Timestamp
d = {Timestamp('2022-10-07 11:06:09.957000'): 21.9,
Timestamp('2022-11-19 04:53:18.532000'): 47.5,
Timestamp('2022-11-19 16:30:04.564000'): 66.9,
Timestamp('2022-11-21 04:17:57.832000'): 96.9,
Timestamp('2022-12-05 22:26:48.354000'): 118.6}
df = pd.DataFrame.from_dict(d, orient='index', columns=['values'])
print(df)
values
2022-10-07 11:06:09.957 21.9
2022-11-19 04:53:18.532 47.5
2022-11-19 16:30:04.564 66.9
2022-11-21 04:17:57.832 96.9
2022-12-05 22:26:48.354 118.6
Check for equality:
merge = generate_interpolated_time_series(df, freq='D', method='merge')
concat = generate_interpolated_time_series(df, freq='D', method='concat')
reindex = f(df, freq='D')
print(all([merge.equals(concat),merge.equals(reindex)]))
# True
Added bonus would be some performance gain. Here you see the results of a comparison between the 3 methods (applying %timeit
) for different frequencies (['D','H','T','S']
). reindex
in green is fastest for each.
Aside: in your function, raise Exception('Method unknown: ' + metnhod)
contains a typo; should be method
.
Regularly I run into the problem that I have time series data that I want to interpolate and resample at given times. I have a solution, but it feels like "too labor intensive", e.g. I guess there should be a simpler way. Have a look for how I currently do it here: https://gist.github.com/cs224/012f393d5ced6931ae223e6ddc4fe6b2 (or the nicer version via nbviewer here: https://nbviewer.org/gist/cs224/012f393d5ced6931ae223e6ddc4fe6b2)
Perhaps a motivating example: I fill up my car about every two weeks. I have the cost data of every refill. Now I would like to know the cummulative sum on a daily basis, where the day values are at midnight and interpolated.
Currently I create a new empty data frame that contains the time points at which I want to have my resampled values:
df_sampling = pd.DataFrame(index=pd.date_range(start, end, freq=freq))
And then either use pd.merge
:
ldf = pd.merge(df_in, df_sampling, left_index=True, right_index=True, how='outer')
or pd.concat
:
ldf = pd.concat([df_in, df_sampling], axis=1)
to create a combined time series that has the additional time points in the index. Based on that I can then use pd.interpolate
and then sub-select all index values given by df_sampling. See the gist for details.
All this feels too cumbersome and I guess there should be a better way how to do it.
Instead of using either merge
or concat
inside your function generate_interpolated_time_series
, I would rely on df.reindex
. Something like this:
def f(df_in, freq='T', start=None):
if start is None:
start = df_in.index[0].floor('T')
# refactored: df_in.index[0].replace(second=0,microsecond=0,nanosecond=0)
end = df_in.index[-1]
idx = pd.date_range(start=start, end=end, freq=freq)
ldf = df_in.reindex(df_in.index.union(idx)).interpolate().bfill()
ldf = ldf[~ldf.index.isin(df_in.index.difference(idx))]
return ldf
Test sample:
from pandas import Timestamp
d = {Timestamp('2022-10-07 11:06:09.957000'): 21.9,
Timestamp('2022-11-19 04:53:18.532000'): 47.5,
Timestamp('2022-11-19 16:30:04.564000'): 66.9,
Timestamp('2022-11-21 04:17:57.832000'): 96.9,
Timestamp('2022-12-05 22:26:48.354000'): 118.6}
df = pd.DataFrame.from_dict(d, orient='index', columns=['values'])
print(df)
values
2022-10-07 11:06:09.957 21.9
2022-11-19 04:53:18.532 47.5
2022-11-19 16:30:04.564 66.9
2022-11-21 04:17:57.832 96.9
2022-12-05 22:26:48.354 118.6
Check for equality:
merge = generate_interpolated_time_series(df, freq='D', method='merge')
concat = generate_interpolated_time_series(df, freq='D', method='concat')
reindex = f(df, freq='D')
print(all([merge.equals(concat),merge.equals(reindex)]))
# True
Added bonus would be some performance gain. Here you see the results of a comparison between the 3 methods (applying %timeit
) for different frequencies (['D','H','T','S']
). reindex
in green is fastest for each.
Aside: in your function, raise Exception('Method unknown: ' + metnhod)
contains a typo; should be method
.