Pandas: easier way to sample interpolated time series data at given times (e.g. every full day)

Question:

Regularly I run into the problem that I have time series data that I want to interpolate and resample at given times. I have a solution, but it feels like "too labor intensive", e.g. I guess there should be a simpler way. Have a look for how I currently do it here: https://gist.github.com/cs224/012f393d5ced6931ae223e6ddc4fe6b2 (or the nicer version via nbviewer here: https://nbviewer.org/gist/cs224/012f393d5ced6931ae223e6ddc4fe6b2)

Perhaps a motivating example: I fill up my car about every two weeks. I have the cost data of every refill. Now I would like to know the cummulative sum on a daily basis, where the day values are at midnight and interpolated.

Currently I create a new empty data frame that contains the time points at which I want to have my resampled values:

df_sampling = pd.DataFrame(index=pd.date_range(start, end, freq=freq))

And then either use pd.merge:

ldf = pd.merge(df_in, df_sampling, left_index=True, right_index=True, how='outer')

or pd.concat:

ldf = pd.concat([df_in, df_sampling], axis=1)

to create a combined time series that has the additional time points in the index. Based on that I can then use pd.interpolate and then sub-select all index values given by df_sampling. See the gist for details.

All this feels too cumbersome and I guess there should be a better way how to do it.

Asked By: cs224

||

Answers:

Instead of using either merge or concat inside your function generate_interpolated_time_series, I would rely on df.reindex. Something like this:

def f(df_in, freq='T', start=None):
    if start is None:
        start = df_in.index[0].floor('T')
        # refactored: df_in.index[0].replace(second=0,microsecond=0,nanosecond=0)
    end = df_in.index[-1]
    idx = pd.date_range(start=start, end=end, freq=freq)
    ldf = df_in.reindex(df_in.index.union(idx)).interpolate().bfill()
    ldf = ldf[~ldf.index.isin(df_in.index.difference(idx))]
    return ldf

Test sample:

from pandas import Timestamp

d = {Timestamp('2022-10-07 11:06:09.957000'): 21.9,
 Timestamp('2022-11-19 04:53:18.532000'): 47.5,
 Timestamp('2022-11-19 16:30:04.564000'): 66.9,
 Timestamp('2022-11-21 04:17:57.832000'): 96.9,
 Timestamp('2022-12-05 22:26:48.354000'): 118.6}

df = pd.DataFrame.from_dict(d, orient='index', columns=['values'])

print(df)

                         values
2022-10-07 11:06:09.957    21.9
2022-11-19 04:53:18.532    47.5
2022-11-19 16:30:04.564    66.9
2022-11-21 04:17:57.832    96.9
2022-12-05 22:26:48.354   118.6

Check for equality:

merge = generate_interpolated_time_series(df, freq='D', method='merge')
concat = generate_interpolated_time_series(df, freq='D', method='concat')
reindex = f(df, freq='D')

print(all([merge.equals(concat),merge.equals(reindex)]))
# True

Added bonus would be some performance gain. Here you see the results of a comparison between the 3 methods (applying %timeit) for different frequencies (['D','H','T','S']). reindex in green is fastest for each.

comparison


Aside: in your function, raise Exception('Method unknown: ' + metnhod) contains a typo; should be method.

Answered By: ouroboros1