Pandas.resample to a non-integer multiple frequency

Question

I have to resample my dataset from a 10-minute interval to a 15-minute interval to make it in sync with another dataset. Based on my searches at stackoverflow I have some ideas how to proceed, but none of them deliver a clean and clear solution.

Problem

Problem set up

#%% Import modules 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#%% make timestamps
periods = 12
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)


#%% Make DataFrame and fill it with some data
df = pd.DataFrame(index=timestamp10min)
y = -(np.arange(periods)-periods/2)**2
df['y'] = y

Desired output

Now I want the values that are already at the 10 minutes to be unchanged, and the values at **:15 and **:45 to be the mean of **:10, **:20 and **:40, **:50. The core of the problem is that 15 minutes is not a integer multiple of 10 minutes. Otherwise simply applying df.resample('10Min', how='mean') would have worked.

Possible solutions

Simply use the 15 minutes resampling and just live with the small introduced error.
Using two forms of resample, with close='left', label='left' and close='right' , label='right'. Afterwards I could average both resampled forms. The results will give me some error on the results, but smaller than the first method.
Resample everything to 5 minute data and then apply a rolling average. Something like that is apllied here: Pandas: rolling mean by time interval
Resample and average with a varying number of input: Use numpy.average with weights for resampling a pandas array
Therefore I would have to create a new Series with varying weight length. Were the weight should be alternating between 1 and 2.
Resample everything to 5 minute data and then apply linear interpolation. This method is close to method 3. Pandas data frame: resample with linear interpolation
Edit: @Paul H gave a workable solution along these lines, which is stille readable. Thanks!

All the methods are not really statisfying for me. Some lead to a small error, and other methods would be quite difficult to read for an outsider.

Implementation

The implementation of method 1, 2 and 5 together with the desired ouput. In combination with visualization.

#%% start plot
plt.figure()
plt.plot(df.index, df['y'], label='original')

#%% resample the data to 15 minutes and plot the result
close = 'left'; label='left'
dfresamplell = pd.DataFrame()
dfresamplell['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label        
plt.plot(dfresamplell.index, dfresamplell['15min'], label=labelstring)
        
close = 'right'; label='right'
dfresamplerr = pd.DataFrame()
dfresamplerr['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label        
plt.plot(dfresamplerr.index, dfresamplerr['15min'], label=labelstring)

#%% make an average
dfresampleaverage = pd.DataFrame(index=dfresamplell.index)
dfresampleaverage['15min'] = (dfresamplell['15min'].values+dfresamplerr['15min'].values[:-1])/2
plt.plot(dfresampleaverage.index, dfresampleaverage['15min'], label='average of both resampling methods')

#%% desired output
ydesired = np.zeros(periods/3*2)
i = 0 
j = 0 
k = 0 
for val in ydesired:
    if i+k==len(y): k=0
    ydesired[j] = np.mean([y[i],y[i+k]]) 
    j+=1
    i+=1
    if k==0: k=1; 
    else: k=0; i+=1
plt.plot(dfresamplell.index, ydesired, label='ydesired')


#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex.interpolate(inplace=True)
dfreindex = dfreindex.resample('15T', how='first').head()
plt.plot(dfreindex.index, dfreindex['y'], label='method Paul H')


#%% finalize plot
plt.legend()

Implementation for angles

As a bonus I have added the code I will use for the interpolation of angles. This is done by using complex numbers. Because complex interpolation is not implemented (yet), I split the complex numbers into a real and a imaginary part. After averaging these numbers can be converted to angels again. For certain angels this is a better resampling method than simply averaging the two angels, for example: 345 and 5 degrees.

#%% make timestamps
periods = 24*6
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)

#%% Make DataFrame and fill it with some data
degrees = np.cumsum(np.random.randn(periods)*25) % 360
df = pd.DataFrame(index=timestamp10min)
df['deg'] = degrees
df['zreal'] = np.cos(df['deg']*np.pi/180)
df['zimag'] = np.sin(df['deg']*np.pi/180)

#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex = dfreindex.interpolate()
dfresample = dfreindex.resample('15T', how='first')

#%% convert complex to degrees
def f(x):    
     return np.angle(x[0] + x[1]*1j, deg=True )
dfresample['degrees'] = dfresample[['zreal', 'zimag']].apply(f, axis=1)

#%% set all the values between 0-360 degrees
dfresample.loc[dfresample['degrees']<0] = 360 + dfresample.loc[dfresample['degrees']<0] 

#%% wrong resampling
dfresample['deg'] = dfresample['deg'] % 360

#%% plot different sampling methods
plt.figure()
plt.plot(df.index, df['deg'], label='normal', marker='v')
plt.plot(dfresample.index, dfresample['degrees'], label='resampled according @Paul H', marker='^')
plt.plot(dfresample.index, dfresample['deg'], label='wrong resampling', marker='<')
plt.legend()

Asked By: Hennep

||

Source

Answer 1

Ok, here’s one way to do it.

Make a list of the times you want to have filled in
Make a combined index that includes the times you want and the times you already have
Take your data and “forward fill it”
Take your data and “backward fill it”
Average the forward and backward fills
Select only the rows you want

Note this only works since you want the values exactly halfway between the values you already have, time-wise. Note the last time comes out np.nan because you don’t have any later data.

times_15 = []
current = df.index[0]
while current < df.index[-2]:
    current = current + dt.timedelta(minutes=15)
    times_15.append(current)
combined = set(times_15) | set(df.index)
df = df.reindex(combined).sort_index(axis=0)
df['ff'] = df['y'].fillna(method='ffill')
df['bf'] = df['y'].fillna(method='bfill')
df['solution'] = df[['ff', 'bf']].mean(1)
df.loc[times_15, :]

Answered By: 8one6

Answer 2

I might be misunderstanding the problem, but does this work?

TL;DR version:

import numpy as np
import pandas

data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df.reindex(index=index_05T).interpolate().loc[index_15T])

Long version

setup fake data

import numpy as np
import pandas

data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df1)


                      A
2012-01-01 00:00:00   0
2012-01-01 00:10:00   8
2012-01-01 00:20:00  16
2012-01-01 00:30:00  24
2012-01-01 00:40:00  32
2012-01-01 00:50:00  40
2012-01-01 01:00:00  48
2012-01-01 01:10:00  56
2012-01-01 01:20:00  64
2012-01-01 01:30:00  72
2012-01-01 01:40:00  80
2012-01-01 01:50:00  88
2012-01-01 02:00:00  96

So then build a new 5-minute index and reindex the original dataframe

index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
df2 = df.reindex(index=index_05T)
print(df2)

                      A
2012-01-01 00:00:00   0
2012-01-01 00:05:00 NaN
2012-01-01 00:10:00   8
2012-01-01 00:15:00 NaN
2012-01-01 00:20:00  16
2012-01-01 00:25:00 NaN
2012-01-01 00:30:00  24
2012-01-01 00:35:00 NaN
2012-01-01 00:40:00  32
2012-01-01 00:45:00 NaN
2012-01-01 00:50:00  40
2012-01-01 00:55:00 NaN
2012-01-01 01:00:00  48
2012-01-01 01:05:00 NaN
2012-01-01 01:10:00  56
2012-01-01 01:15:00 NaN
2012-01-01 01:20:00  64
2012-01-01 01:25:00 NaN
2012-01-01 01:30:00  72
2012-01-01 01:35:00 NaN
2012-01-01 01:40:00  80
2012-01-01 01:45:00 NaN
2012-01-01 01:50:00  88
2012-01-01 01:55:00 NaN
2012-01-01 02:00:00  96

and then linearly interpolate

print(df2.interpolate())
                      A
2012-01-01 00:00:00   0
2012-01-01 00:05:00   4
2012-01-01 00:10:00   8
2012-01-01 00:15:00  12
2012-01-01 00:20:00  16
2012-01-01 00:25:00  20
2012-01-01 00:30:00  24
2012-01-01 00:35:00  28
2012-01-01 00:40:00  32
2012-01-01 00:45:00  36
2012-01-01 00:50:00  40
2012-01-01 00:55:00  44
2012-01-01 01:00:00  48
2012-01-01 01:05:00  52
2012-01-01 01:10:00  56
2012-01-01 01:15:00  60
2012-01-01 01:20:00  64
2012-01-01 01:25:00  68
2012-01-01 01:30:00  72
2012-01-01 01:35:00  76
2012-01-01 01:40:00  80
2012-01-01 01:45:00  84
2012-01-01 01:50:00  88
2012-01-01 01:55:00  92
2012-01-01 02:00:00  96

build a 15-minute index and use that to pull out data:

index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
print(df2.interpolate().loc[index_15T])

                      A
2012-01-01 00:00:00   0
2012-01-01 00:15:00  12
2012-01-01 00:30:00  24
2012-01-01 00:45:00  36
2012-01-01 01:00:00  48
2012-01-01 01:15:00  60
2012-01-01 01:30:00  72
2012-01-01 01:45:00  84
2012-01-01 02:00:00  96

Answered By: Paul H

Answer 3

In case someone is working with data without regularity at all, here is an adapted solution from the one provided by Paul H above.

If you don’t want to interpolate throughout the time-series, but only in those places where resample is meaningful, you may keep the interpolated column side by side and finish with a resample and dropna.

import numpy as np
import pandas

data = np.arange(0, 101, 3)
index_setup = pandas.date_range(freq='01T', start='2022-01-01 00:00',     periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_setup, columns=['A'])
df1 = df1.sample(frac=0.2).sort_index()
print(df1)
                      A
2022-01-01 00:03:00   9
2022-01-01 00:06:00  18
2022-01-01 00:08:00  24
2022-01-01 00:18:00  54
2022-01-01 00:25:00  75
2022-01-01 00:27:00  81
2022-01-01 00:30:00  90

Notice resampling this DF without any regularity forces values to the floor index, without interpolating.

print(df1.resample('05T').mean())

                        A
2022-01-01 00:00:00   9.0
2022-01-01 00:05:00  24.0
2022-01-01 00:10:00  39.0
2022-01-01 00:15:00  51.0
2022-01-01 00:20:00   NaN
2022-01-01 00:25:00  79.5

A better solution can be achieved by interpolating in a small enough interval and then resampling. The result DF now has too much, but a dropna() brings it close to its original shape.

index_1min = pandas.date_range(freq='01T', start='2022-01-01 00:00', end='2022-01-01 23:59')
df2 = df1.reindex(index=index_1min)
df2['A_interp'] = df2['A'].interpolate(limit_direction='both')
print(df2.resample('05T').first().dropna())

                        A  A_interp
2022-01-01 00:00:00   9.0       9.0
2022-01-01 00:05:00  21.0      15.0
2022-01-01 00:10:00  39.0      30.0
2022-01-01 00:15:00  51.0      45.0
2022-01-01 00:25:00  75.0      75.0

Answered By: Bicudo