Pandas rolling regression: alternatives to looping
Question:
I got good use out of pandas’ MovingOLS
class (source here) within the deprecated stats/ols
module. Unfortunately, it was gutted completely with pandas 0.20.
The question of how to run rolling OLS regression in an efficient manner has been asked several times (here, for instance), but phrased a little broadly and left without a great answer, in my view.
Here are my questions:

How can I best mimic the basic framework of pandas’
MovingOLS
? The most attractive feature of this class was the ability to view multiple methods/attributes as separate time series–i.e. coefficients, rsquared, tstatistics, etc without needing to rerun regression. For example, you could create something likemodel = pd.MovingOLS(y, x)
and then call.t_stat
,.rmse
,.std_err
, and the like. In the example below, conversely, I don’t see a way around being forced to compute each statistic separately. Is there a method that doesn’t involve creating sliding/rolling “blocks” (strides) and running regressions/using linear algebra to get model parameters for each? 
More broadly, what’s going on under the hood in pandas that makes
rolling.apply
not able to take more complex functions?* When you create a.rolling
object, in layman’s terms, what’s going on internally–is it fundamentally different from looping over each window and creating a higherdimensional array as I’m doing below?
*Namely, func
passed to .apply
:
Must produce a single value from an ndarray input *args and **kwargs
are passed to the function
Here’s where I’m currently at with some sample data, regressing percentage changes in the trade weighted dollar on interest rate spreads and the price of copper. (This doesn’t make a ton of sense; just picked these randomly.) I’ve taken it out of a classbased implementation and tried to strip it down to a simpler script.
from datetime import date
from pandas_datareader.data import DataReader
import statsmodels.formula.api as smf
syms = {'TWEXBMTH' : 'usd',
'T10Y2YM' : 'term_spread',
'PCOPPUSDM' : 'copper'
}
start = date(2000, 1, 1)
data = (DataReader(syms.keys(), 'fred', start)
.pct_change()
.dropna())
data = data.rename(columns = syms)
data = data.assign(intercept = 1.) # required by statsmodels OLS
def sliding_windows(x, window):
"""Create rolling/sliding windows of length ~window~.
Given an array of shape (y, z), it will return "blocks" of shape
(x  window + 1, window, z)."""
return np.array([x[i:i + window] for i
in range(0, x.shape[0]  window + 1)])
data.head(3)
Out[33]:
usd term_spread copper intercept
DATE
20000201 0.012573 1.409091 0.019972 1.0
20000301 0.000079 2.000000 0.037202 1.0
20000401 0.005642 0.518519 0.033275 1.0
window = 36
wins = sliding_windows(data.values, window=window)
y, x = wins[:, :, 0], wins[:, :, 1:]
coefs = []
for endog, exog in zip(y, x):
model = smf.OLS(endog, exog).fit()
# The full set of model attributes gets lost with each loop
coefs.append(model.params)
df = pd.DataFrame(coefs, columns=data.iloc[:, 1:].columns,
index=data.index[window  1:])
df.head(3) # rolling 36m coefficients
Out[70]:
term_spread copper intercept
DATE
20030101 0.000122 0.018426 0.001937
20030201 0.000391 0.015740 0.001597
20030301 0.000655 0.016811 0.001546
Answers:
I created an ols
module designed to mimic pandas’ deprecated MovingOLS
; it is here.
It has three core classes:
OLS
: static (singlewindow) ordinary leastsquares regression. The output are NumPy arraysRollingOLS
: rolling (multiwindow) ordinary leastsquares regression. The output are higherdimension NumPy arrays.PandasRollingOLS
: wraps the results ofRollingOLS
in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.
Note that the module is part of a package (which I’m currently in the process of uploading to PyPi) and it requires one interpackage import.
The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS
takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels’ OLS RegressionResultsWrapper
.
An example:
import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS
# You can also do this with pandasdatareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
syms = {
"TWEXBMTH" : "usd",
"T10Y2YM" : "term_spread",
"GOLDAMGBD228NLBM" : "gold",
}
params = {
"fq": "Monthly,Monthly,Monthly",
"id": ",".join(syms.keys()),
"cosd": "20000101",
"coed": "20190201",
}
data = pd.read_csv(
url + "?" + urllib.parse.urlencode(params, safe=","),
na_values={"."},
parse_dates=["DATE"],
index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
# usd term_spread gold
# DATE
# 20000201 0.012580 1.409091 0.057152
# 20000301 0.000113 2.000000 0.047034
# 20000401 0.005634 0.518519 0.023520
# 20000501 0.022017 0.097561 0.016675
# 20000601 0.010116 0.027027 0.036599
y = data.usd
x = data.drop('usd', axis=1)
window = 12 # months
model = PandasRollingOLS(y=y, x=x, window=window)
print(model.beta.head()) # Coefficients excluding the intercept
# term_spread gold
# DATE
# 20010101 0.000033 0.054261
# 20010201 0.000277 0.188556
# 20010301 0.002432 0.294865
# 20010401 0.002796 0.334880
# 20010501 0.002448 0.241902
print(model.fstat.head())
# DATE
# 20010101 0.136991
# 20010201 1.233794
# 20010301 3.053000
# 20010401 3.997486
# 20010501 3.855118
# Name: fstat, dtype: float64
print(model.rsq.head()) # Rsquared
# DATE
# 20010101 0.029543
# 20010201 0.215179
# 20010301 0.404210
# 20010401 0.470432
# 20010501 0.461408
# Name: rsq, dtype: float64
Use a custom rolling apply function.
import numpy as np
df['slope'] = df.values.rolling(window=125).apply(lambda x: np.polyfit(np.array(range(0,125)), x, 1)[0], raw=True)