Pandas Rolling Gradient – Improving/Reducing Computation Time
Question:
I am calculating the rolling slope or gradient of a column in a pandas data frame with a datetime index and looking for suggestions to reduce computation time over the current approach using .rolling and .apply (detailed below).
You have additional requirements which are the minimum number of observations to include in the rolling calculation and the maximum window size (see example below):
Example, minimum number of points = 3, maximum window size = 7 days
datetime values intended_window. gradient
01-01-2010 00:00:00 10 np.nan NaN
01-02-2010 00:00:00 11 np.nan NaN
01-03-2010 00:00:00 12 [10,11,12] 0.04167
01-04-2010 00:00:00 13 [10,11,12,13] 0.04167
01-05-2010 00:00:00 14 [10,11,12,13,14] 0.04167
01-06-2010 00:00:00 15 [10,11,12,13,14,15] 0.04167
01-07-2010 00:00:00 16 [10,11,12,13,14,15,16] 0.04167
01-08-2010 00:00:00 17 [11,12,13,14,15,16,17] 0.04167
01-09-2010 00:00:00 18 [12,12,14,15,16,17,18] 0.04167
01-10-2010 00:00:00 19 [13,14,15,16,17,18,19] 0.04167
The current approach is effectively:
gradient = df['values'].rolling(window='7d', min_periods=3).apply(get_slope, raw=False)
where
def get_slope(df):
df = df.dropna()
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df)
slope, intercept, r_value, p_value, std_err = linregress(x,y)
return slope
Does anyone have a suggestion on how this could be radically sped up? When increasing the maximum window size, the computation time increasing significantly. Is there anyway to vectorise this calculation?
Answers:
I do not know I could call it "radical", but I seem to be getting a 10-15% speedup essentially for free by replacing linregress
with polyfit
in get slope:
def get_slope_polyfit(df):
df = df.dropna()
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df)
slope, _ = polyfit(x, y, 1)
return slope
Moving some calculations outside the loop also seems to give another 5-10% speedup.
from time import time
import pandas as pd
import numpy as np
from scipy.stats import linregress
from numpy import polyfit
from numpy.lib.stride_tricks import sliding_window_view
N = 10000
dti = pd.date_range('2010-01-01', periods=N, freq='D')
values = np.arange(N) *1.0
values[10: 20] = np.nan
df = pd.DataFrame(values, index=dti, columns=['values'])
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df.values).T.squeeze()
def get_slope(df):
df = df.dropna()
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df)
slope, intercept, r_value, p_value, std_err = linregress(x,y)
return slope
def get_slope_polyfit(df):
df = df.dropna()
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df)
slope, _ = polyfit(x, y, 1)
return slope
def get_slope_with_precalculations(dfi):
x = dfi['timedelta']
y = dfi['values']
x = x[~np.isnan(y)]
y = y[~np.isnan(y)]
if x.size < 2:
return np.nan
slope, _ = polyfit(x, y, 1)
return slope
print('original calculation')
begin = time()
gradient = df['values'].rolling(window='7d', min_periods=3).apply(get_slope, raw=False)
end = time()
print(f"execution time {end - begin}")
print('get_slope_polyfit calculation')
begin = time()
gradient_polyfit = df['values'].rolling(window='7d', min_periods=3).apply(get_slope_polyfit, raw=False)
end = time()
print(f"execution time {end - begin}")
print(f"with pre-calculations")
begin = time()
min_date = df.index.min()
df['timedelta'] = (df.index - min_date).total_seconds()/60/60
gradient_precalculations = np.array([get_slope_with_precalculations(dfi) for dfi in df.rolling(window='7d', min_periods=3)])
end = time()
print(f"execution time {end - begin}")
Output:
> original calculation
> execution time 9.473661422729492
> get_slope_polyfit calculation
> execution time 8.135330200195312
> with pre-calculations
> execution time 7.553420305252075
Okay, here is are my first results (managed to get a ~7x improvement). However, I’m pretty sure that if you assume no nans, you can get a ~100x to 1000x speed improvement, but that’s for another time. — update, see the edit below
Profiling the get_slope function reveals the 3 bottlenecks:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
14 def get_slope(df):
15 998 2374316.0 2379.1 18.6 df = df.dropna()
16 998 494087.0 495.1 3.9 min_date = df.index.min()
17 998 6298961.0 6311.6 49.4 x = (df.index - min_date).total_seconds()/3600
18 998 131353.0 131.6 1.0 y = np.array(df)
19 998 3447066.0 3454.0 27.0 slope, intercept, r_value, p_value, std_err = linregress(x, y)
20 998 8157.0 8.2 0.1 return slope
As we can see, dropna
, the creation of x
, and the slope calculations are what takes time. There is no easy solution to the dropna
problem, but the other two slow functions can be removed. The slope computation actually does a least-squares fitting, which, as noted by @zap, can be slightly improved by a polyfit, but it can be accelerated even more if we hard-code it:
def get_slope2(df):
df = df.dropna() # takes 24.5% of the time
min_date = df.index.min() # takes 4.5% of the time
x = (df.index - min_date).total_seconds()/3600 # takes 70% of the time
x = x.to_numpy()
y = df.to_numpy()
n = len(x)
xsum = x.sum()/n
ysum = y.sum()/n
xx = x.dot(x)/n
xy = x.dot(y)/n
den = xx - xsum*xsum
slope = (xy - xsum * ysum)/den
return slope
This version is already ~1.5x faster. To solve the problem of the computation of x
, the solution is to make the conversion to seconds only once, for the whole array, and use the seconds as index. The slope function would look like
def get_slope3(df2):
df2 = df2.dropna()
x = df2.index.to_numpy()
x -= x.min()
y = df2.to_numpy()
n = len(x)
xsum = x.sum()/n
ysum = y.sum()/n
xx = x.dot(x)/n
xy = x.dot(y)/n
den = xx - xsum*xsum
slope = (xy - xsum * ysum)/den
return slope
with the new dataframe being
min_date = df.index.min()
df2 = df.set_index((df.index - min_date).total_seconds()/3600)
with 10000 elems, I get the following timings:
original : time = 6919.07 ms
get_slope2 : time = 4542.78 ms
get_slope3 : time = 942.982 ms
And commenting the dropna
adds an additional
2x speedup.
Some further optimizatons would be to compute everything at once. If there are no nans, we can compute the sums as a difference of a global cumsum, which would be insanely fast, allowing a O(n) time (with a small constant) regardless of the window size. If there are nans, this approach could also be used by interpolating the the values for the nans, then recomputing the gradients around the interoplated values by more traditional means, but this gets a bit complicated (but since you said radically)
Edit : getting a 1000x speedup (+ solving the window problem)
The idea here will be to compute everything in a handful of (numpy) function calls. To do so, we will need to know the local x
and y
, and thus compute which data points to use based window size (given in hours here). The number of data-points is computed by the function
from numba import njit
import warnings
@njit
def getwinsize(x, win, min_periods):
m = 0
n = x.size
out = np.empty(n, dtype=np.int32)
i = 0
j = 0
while(i < n):
if x[j] + win > x[i]:
out[i] = i-j+1 if i-j+1 >= min_periods else -1
m = m if m > out[i] else out[i]
i += 1
else:
j += 1
return out, m
Using the njit macro from numba is not necessary but it surely helps, especially when the inputs are large. The slope-computing-function is
def get_slope4(df, winsizeInHours=7*24, min_periods=3):
hours = (df.index - min_date).total_seconds().to_numpy()/3600
y = df.to_numpy().ravel()
N = len(hours)
locwinsize, maxwinsize = getwinsize(hours, winsizeInHours, min_periods)
X = np.empty((N, maxwinsize))
Y = np.empty((N, maxwinsize))
for i in range(maxwinsize):
X[i:,i] = hours[:N-i]
Y[i:,i] = y[:N-i]
mask = np.isnan(Y)
for i in range(maxwinsize):
mask[:, i] = np.logical_or(mask[:, i], locwinsize<=i)
X[mask] = np.NaN
Y[mask] = np.NaN
XY = X*Y
XX = X*X
with warnings.catch_warnings(): #ignore warning for "mean of empty slie"
warnings.simplefilter("ignore", category=RuntimeWarning)
Xbar = np.nanmean(X, axis=1)
Ybar = np.nanmean(Y, axis=1)
XXbar = np.nanmean(XX, axis=1)
XYbar = np.nanmean(XY, axis=1)
den = XXbar - Xbar*Xbar
slopes = (XYbar - Xbar * Ybar)/den
return slopes
This code gives the same result as the original one (with window="7d"
), but is much faster. The returned value is also a numpy array and not a data-frame
Here are some timings with 10000 samples:
Initial code : time = 6860.03 ms
get_slope4 without numba : time = 28.27 ms
get_slope4 with numba : time = 5.06 ms
So the non-numba version gives a 240x speed improvement and the numba version gives a >1000x speed bonus, so hopefully that’s good enough.
using sklearn
Linear regression I get 10x speedup for computing the slope of a time series.
First I set the date as index of my Pandas Series:
df.set_index('datetime ', inplace=True)
def get_slope(df):
import datetime as dt
from sklearn import linear_model
# Convert the datatime to ordinal
date_ordinal = pd.to_datetime(df.index).map(dt.datetime.toordinal)
# Fit the model
reg = linear_model.LinearRegression()
reg.fit(date_ordinal.values.reshape(-1, 1), df.values)
return reg.coef_[0]
# compute the rolling gradient
df['gradient'] = df.value.rolling(10).apply(get_slope, raw=False)
I am calculating the rolling slope or gradient of a column in a pandas data frame with a datetime index and looking for suggestions to reduce computation time over the current approach using .rolling and .apply (detailed below).
You have additional requirements which are the minimum number of observations to include in the rolling calculation and the maximum window size (see example below):
Example, minimum number of points = 3, maximum window size = 7 days
datetime values intended_window. gradient
01-01-2010 00:00:00 10 np.nan NaN
01-02-2010 00:00:00 11 np.nan NaN
01-03-2010 00:00:00 12 [10,11,12] 0.04167
01-04-2010 00:00:00 13 [10,11,12,13] 0.04167
01-05-2010 00:00:00 14 [10,11,12,13,14] 0.04167
01-06-2010 00:00:00 15 [10,11,12,13,14,15] 0.04167
01-07-2010 00:00:00 16 [10,11,12,13,14,15,16] 0.04167
01-08-2010 00:00:00 17 [11,12,13,14,15,16,17] 0.04167
01-09-2010 00:00:00 18 [12,12,14,15,16,17,18] 0.04167
01-10-2010 00:00:00 19 [13,14,15,16,17,18,19] 0.04167
The current approach is effectively:
gradient = df['values'].rolling(window='7d', min_periods=3).apply(get_slope, raw=False)
where
def get_slope(df):
df = df.dropna()
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df)
slope, intercept, r_value, p_value, std_err = linregress(x,y)
return slope
Does anyone have a suggestion on how this could be radically sped up? When increasing the maximum window size, the computation time increasing significantly. Is there anyway to vectorise this calculation?
I do not know I could call it "radical", but I seem to be getting a 10-15% speedup essentially for free by replacing linregress
with polyfit
in get slope:
def get_slope_polyfit(df):
df = df.dropna()
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df)
slope, _ = polyfit(x, y, 1)
return slope
Moving some calculations outside the loop also seems to give another 5-10% speedup.
from time import time
import pandas as pd
import numpy as np
from scipy.stats import linregress
from numpy import polyfit
from numpy.lib.stride_tricks import sliding_window_view
N = 10000
dti = pd.date_range('2010-01-01', periods=N, freq='D')
values = np.arange(N) *1.0
values[10: 20] = np.nan
df = pd.DataFrame(values, index=dti, columns=['values'])
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df.values).T.squeeze()
def get_slope(df):
df = df.dropna()
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df)
slope, intercept, r_value, p_value, std_err = linregress(x,y)
return slope
def get_slope_polyfit(df):
df = df.dropna()
min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60
y = np.array(df)
slope, _ = polyfit(x, y, 1)
return slope
def get_slope_with_precalculations(dfi):
x = dfi['timedelta']
y = dfi['values']
x = x[~np.isnan(y)]
y = y[~np.isnan(y)]
if x.size < 2:
return np.nan
slope, _ = polyfit(x, y, 1)
return slope
print('original calculation')
begin = time()
gradient = df['values'].rolling(window='7d', min_periods=3).apply(get_slope, raw=False)
end = time()
print(f"execution time {end - begin}")
print('get_slope_polyfit calculation')
begin = time()
gradient_polyfit = df['values'].rolling(window='7d', min_periods=3).apply(get_slope_polyfit, raw=False)
end = time()
print(f"execution time {end - begin}")
print(f"with pre-calculations")
begin = time()
min_date = df.index.min()
df['timedelta'] = (df.index - min_date).total_seconds()/60/60
gradient_precalculations = np.array([get_slope_with_precalculations(dfi) for dfi in df.rolling(window='7d', min_periods=3)])
end = time()
print(f"execution time {end - begin}")
Output:
> original calculation
> execution time 9.473661422729492
> get_slope_polyfit calculation
> execution time 8.135330200195312
> with pre-calculations
> execution time 7.553420305252075
Okay, here is are my first results (managed to get a ~7x improvement). However, I’m pretty sure that if you assume no nans, you can get a ~100x to 1000x speed improvement, but that’s for another time. — update, see the edit below
Profiling the get_slope function reveals the 3 bottlenecks:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
14 def get_slope(df):
15 998 2374316.0 2379.1 18.6 df = df.dropna()
16 998 494087.0 495.1 3.9 min_date = df.index.min()
17 998 6298961.0 6311.6 49.4 x = (df.index - min_date).total_seconds()/3600
18 998 131353.0 131.6 1.0 y = np.array(df)
19 998 3447066.0 3454.0 27.0 slope, intercept, r_value, p_value, std_err = linregress(x, y)
20 998 8157.0 8.2 0.1 return slope
As we can see, dropna
, the creation of x
, and the slope calculations are what takes time. There is no easy solution to the dropna
problem, but the other two slow functions can be removed. The slope computation actually does a least-squares fitting, which, as noted by @zap, can be slightly improved by a polyfit, but it can be accelerated even more if we hard-code it:
def get_slope2(df):
df = df.dropna() # takes 24.5% of the time
min_date = df.index.min() # takes 4.5% of the time
x = (df.index - min_date).total_seconds()/3600 # takes 70% of the time
x = x.to_numpy()
y = df.to_numpy()
n = len(x)
xsum = x.sum()/n
ysum = y.sum()/n
xx = x.dot(x)/n
xy = x.dot(y)/n
den = xx - xsum*xsum
slope = (xy - xsum * ysum)/den
return slope
This version is already ~1.5x faster. To solve the problem of the computation of x
, the solution is to make the conversion to seconds only once, for the whole array, and use the seconds as index. The slope function would look like
def get_slope3(df2):
df2 = df2.dropna()
x = df2.index.to_numpy()
x -= x.min()
y = df2.to_numpy()
n = len(x)
xsum = x.sum()/n
ysum = y.sum()/n
xx = x.dot(x)/n
xy = x.dot(y)/n
den = xx - xsum*xsum
slope = (xy - xsum * ysum)/den
return slope
with the new dataframe being
min_date = df.index.min()
df2 = df.set_index((df.index - min_date).total_seconds()/3600)
with 10000 elems, I get the following timings:
original : time = 6919.07 ms
get_slope2 : time = 4542.78 ms
get_slope3 : time = 942.982 ms
And commenting the dropna
adds an additional
2x speedup.
Some further optimizatons would be to compute everything at once. If there are no nans, we can compute the sums as a difference of a global cumsum, which would be insanely fast, allowing a O(n) time (with a small constant) regardless of the window size. If there are nans, this approach could also be used by interpolating the the values for the nans, then recomputing the gradients around the interoplated values by more traditional means, but this gets a bit complicated (but since you said radically)
Edit : getting a 1000x speedup (+ solving the window problem)
The idea here will be to compute everything in a handful of (numpy) function calls. To do so, we will need to know the local x
and y
, and thus compute which data points to use based window size (given in hours here). The number of data-points is computed by the function
from numba import njit
import warnings
@njit
def getwinsize(x, win, min_periods):
m = 0
n = x.size
out = np.empty(n, dtype=np.int32)
i = 0
j = 0
while(i < n):
if x[j] + win > x[i]:
out[i] = i-j+1 if i-j+1 >= min_periods else -1
m = m if m > out[i] else out[i]
i += 1
else:
j += 1
return out, m
Using the njit macro from numba is not necessary but it surely helps, especially when the inputs are large. The slope-computing-function is
def get_slope4(df, winsizeInHours=7*24, min_periods=3):
hours = (df.index - min_date).total_seconds().to_numpy()/3600
y = df.to_numpy().ravel()
N = len(hours)
locwinsize, maxwinsize = getwinsize(hours, winsizeInHours, min_periods)
X = np.empty((N, maxwinsize))
Y = np.empty((N, maxwinsize))
for i in range(maxwinsize):
X[i:,i] = hours[:N-i]
Y[i:,i] = y[:N-i]
mask = np.isnan(Y)
for i in range(maxwinsize):
mask[:, i] = np.logical_or(mask[:, i], locwinsize<=i)
X[mask] = np.NaN
Y[mask] = np.NaN
XY = X*Y
XX = X*X
with warnings.catch_warnings(): #ignore warning for "mean of empty slie"
warnings.simplefilter("ignore", category=RuntimeWarning)
Xbar = np.nanmean(X, axis=1)
Ybar = np.nanmean(Y, axis=1)
XXbar = np.nanmean(XX, axis=1)
XYbar = np.nanmean(XY, axis=1)
den = XXbar - Xbar*Xbar
slopes = (XYbar - Xbar * Ybar)/den
return slopes
This code gives the same result as the original one (with window="7d"
), but is much faster. The returned value is also a numpy array and not a data-frame
Here are some timings with 10000 samples:
Initial code : time = 6860.03 ms
get_slope4 without numba : time = 28.27 ms
get_slope4 with numba : time = 5.06 ms
So the non-numba version gives a 240x speed improvement and the numba version gives a >1000x speed bonus, so hopefully that’s good enough.
using sklearn
Linear regression I get 10x speedup for computing the slope of a time series.
First I set the date as index of my Pandas Series:
df.set_index('datetime ', inplace=True)
def get_slope(df):
import datetime as dt
from sklearn import linear_model
# Convert the datatime to ordinal
date_ordinal = pd.to_datetime(df.index).map(dt.datetime.toordinal)
# Fit the model
reg = linear_model.LinearRegression()
reg.fit(date_ordinal.values.reshape(-1, 1), df.values)
return reg.coef_[0]
# compute the rolling gradient
df['gradient'] = df.value.rolling(10).apply(get_slope, raw=False)