Cumulative OLS with Python Pandas

Question:

I am using Pandas 0.8.1, and at the moment I can’t change the version. If a newer version will help the problem below, please note it in a comment rather than an answer. Also, this is for a research replication project, so even though re-running a regression after appending only one new data point might be silly (if the data set is large), I still have to do it. Thanks!

In Pandas, there is a rolling option for the window_type argument to pandas.ols but it seems implicit that this requires some choice of a window size or use of the whole data sample as default. I’m looking to instead use all the data in a cumulative fashion.

I am trying to run a regression on a pandas.DataFrame that is sorted by date. For each index i, I want to run a regression using the data available from the minimum date up through the date at index i. So the window effectively grows by one on every iteration, all data is cumulatively used from the earliest observation, and no data is ever dropped out of the window.

I have written a function (below) that works with apply to perform this, but it is unacceptably slow. Instead, is there a way to use pandas.ols to directly perform this sort of cumulative regression?

Here are some more specifics about my data. I have a pandas.DataFrame containing a column of identifier, a column of dates, a column of left-hand-side values, and a column of right-hand-side values. I want to use groupby to group based on the identifier, and then perform a cumulative regression for every time period consisting of the left-hand and right-hand-side variables.

Here is the function I am able to use with apply on the identifier-grouped object:

def cumulative_ols(
                   data_frame, 
                   lhs_column, 
                   rhs_column, 
                   date_column,
                   min_obs=60
                  ):

    beta_dict = {}
    for dt in data_frame[date_column].unique():
        cur_df = data_frame[data_frame[date_column] <= dt]
        obs_count = cur_df[lhs_column].notnull().sum()

        if min_obs <= obs_count:
            beta = pandas.ols(
                              y=cur_df[lhs_column],
                              x=cur_df[rhs_column],
                             ).beta.ix['x']
            ###
        else:
            beta = np.NaN
        ###
        beta_dict[dt] = beta
    ###

    beta_df = pandas.DataFrame(pandas.Series(beta_dict, name="FactorBeta"))
    beta_df.index.name = date_column
    return beta_df
Asked By: ely

||

Answers:

Following on the advice in the comments, I created my own function that can be used with apply and which relies on cumsum to accumulate all the individual needed terms for expressing the coefficient from an OLS univariate regression vectorially.

def cumulative_ols(
    data_frame,
    lhs_column,
    rhs_column,
    date_column,
    min_obs=60,
):
    """
    Function to perform a cumulative OLS on a Pandas data frame. It is
    meant to be used with `apply` after grouping the data frame by categories
    and sorting by date, so that the regression below applies to the time
    series of a single category's data and the use of `cumsum` will work    
    appropriately given sorted dates. It is also assumed that the date 
    conventions of the left-hand-side and right-hand-side variables have been 
    arranged by the user to match up with any lagging conventions needed.

    This OLS is implicitly univariate and relies on the simplification to the
    formula:

    Cov(x,y) ~ (1/n)*sum(x*y) - (1/n)*sum(x)*(1/n)*sum(y)
    Var(x)   ~ (1/n)*sum(x^2) - ((1/n)*sum(x))^2
    beta     ~ Cov(x,y) / Var(x)
    
    and the code makes a further simplification be cancelling one factor 
    of (1/n).

    Notes: one easy improvement is to change the date column to a generic sort
    column since there's no special reason the regressions need to be time-
    series specific.
    """
    data_frame["xy"]         = (data_frame[lhs_column] * data_frame[rhs_column]).fillna(0.0)
    data_frame["x2"]         = (data_frame[rhs_column]**2).fillna(0.0)
    data_frame["yobs"]       = data_frame[lhs_column].notnull().map(int)
    data_frame["xobs"]       = data_frame[rhs_column].notnull().map(int)
    data_frame["cum_yobs"]   = data_frame["yobs"].cumsum()
    data_frame["cum_xobs"]   = data_frame["xobs"].cumsum()
    data_frame["cumsum_xy"]  = data_frame["xy"].cumsum()
    data_frame["cumsum_x2"]  = data_frame["x2"].cumsum()
    data_frame["cumsum_x"]   = data_frame[rhs_column].fillna(0.0).cumsum()
    data_frame["cumsum_y"]   = data_frame[lhs_column].fillna(0.0).cumsum()
    data_frame["cum_cov"]    = data_frame["cumsum_xy"] - (1.0/data_frame["cum_yobs"])*data_frame["cumsum_x"]*data_frame["cumsum_y"]
    data_frame["cum_x_var"]  = data_frame["cumsum_x2"] - (1.0/data_frame["cum_xobs"])*(data_frame["cumsum_x"])**2
    data_frame["FactorBeta"] = data_frame["cum_cov"]/data_frame["cum_x_var"]
    data_frame["FactorBeta"][data_frame["cum_yobs"] < min_obs] = np.NaN
    return data_frame[[date_column, "FactorBeta"]].set_index(date_column)
### End cumulative_ols

I have verified on numerous test cases that this matches the output of my former function and the output of NumPy’s linalg.lstsq function. I haven’t done a full benchmark on the timing, but anecdotally, it is around 50 times faster in the cases I’ve been working on.

Answered By: ely