Pandas rolling apply using multiple columns

Question:

I am trying to use a pandas.DataFrame.rolling.apply() rolling function on multiple columns.
Python version is 3.7, pandas is 1.0.2.

import pandas as pd

#function to calculate
def masscenter(x):
    print(x); # for debug purposes
    return 0;

#simple DF creation routine
df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
                    ['03:00:01.042391', 87.51, 10],
                    ['03:00:01.630182', 87.51, 10],
                    ['03:00:01.635150', 88.00, 792],
                    ['03:00:01.914104', 88.00, 10]], 
                   columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df2['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)

'stamp' is monotonic and unique, 'price' is double and contains no NaNs, 'nQty' is integer and also contains no NaNs.

So, I need to calculate rolling ‘center of mass’, i.e. sum(price*nQty)/sum(nQty).

What I tried so far:

df.apply(masscenter, axis = 1)

masscenter is be called 5 times with a single row and the output will be like

price     87.6
nQty     739.0
Name: 1900-01-01 02:59:47.000282, dtype: float64

It is desired input to a masscenter, because I can easily access price and nQty using x[0], x[1]. However, I stuck with rolling.apply()
Reading the docs
DataFrame.rolling() and rolling.apply()
I supposed that using 'axis' in rolling() and 'raw' in apply one achieves similiar behaviour. A naive approach

rol = df.rolling(window=2)
rol.apply(masscenter)

prints row by row (increasing number of rows up to window size)

stamp
1900-01-01 02:59:47.000282    87.60
1900-01-01 03:00:01.042391    87.51
dtype: float64

then

stamp
1900-01-01 02:59:47.000282    739.0
1900-01-01 03:00:01.042391     10.0
dtype: float64

So, columns is passed to masscenter separately (expected).

Sadly, in the docs there is barely any info about 'axis'. However the next variant was, obviously

rol = df.rolling(window=2, axis = 1)
rol.apply(masscenter)

Never calls masscenter and raises ValueError in rol.apply(..)

> Length of passed values is 1, index implies 5

I admit that I’m not sure about 'axis' parameter and how it works due to lack of documentation. It is the first part of the question:
What is going on here? How to use ‘axis’ properly? What it is designed for?

Of course, there were answers previously, namely:

How-to-apply-a-function-to-two-columns-of-pandas-dataframe
It works for the whole DataFrame, not Rolling.

How-to-invoke-pandas-rolling-apply-with-parameters-from-multiple-column
The answer suggests to write my own roll function, but the culprit for me is the same as asked in comments: what if one needs to use offset window size (e.g. '1T') for non-uniform timestamps?
I don’t like the idea to reinvent the wheel from scratch. Also I’d like to use pandas for everything to prevent inconsistency between sets obtained from pandas and ‘self-made roll’.
There is another answer to that question, suggessting to populate dataframe separately and calculate whatever I need, but it will not work: the size of stored data will be enormous.
The same idea presented here:
Apply-rolling-function-on-pandas-dataframe-with-multiple-arguments

Another Q & A posted here
Pandas-using-rolling-on-multiple-columns
It is good and the closest to my problem, but again, there is no possibility to use offset window sizes (window = '1T').

Some of the answers were asked before pandas 1.0 came out, and given that docs could be much better, I hope it is possible to roll over multiple columns simultaneously now.

The second part of the question is:
Is there any possibility to roll over multiple columns simultaneously using pandas 1.0.x with offset window size?

Thank you very much.

Asked By: Suthiro

||

Answers:

You can use rolling_apply function from numpy_ext module:

import numpy as np
import pandas as pd
from numpy_ext import rolling_apply


def masscenter(price, nQty):
    return np.sum(price * nQty) / np.sum(nQty)


df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
                    ['03:00:01.042391', 87.51, 10],
                    ['03:00:01.630182', 87.51, 10],
                    ['03:00:01.635150', 88.00, 792],
                    ['03:00:01.914104', 88.00, 10]], 
                   columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)

window = 2
df['y'] = rolling_apply(masscenter, window, df.price.values, df.nQty.values)
print(df)

                            price  nQty          y
stamp                                             
1900-01-01 02:59:47.000282  87.60   739        NaN
1900-01-01 03:00:01.042391  87.51    10  87.598798
1900-01-01 03:00:01.630182  87.51    10  87.510000
1900-01-01 03:00:01.635150  88.00   792  87.993890
1900-01-01 03:00:01.914104  88.00    10  88.000000
Answered By: saninstein

So I found no way to roll over two columns, however without inbuilt pandas functions.
The code is listed below.

# function to find an index corresponding
# to current value minus offset value
def prevInd(series, offset, date):
    offset = to_offset(offset)
    end_date = date - offset
    end = series.index.searchsorted(end_date, side="left")
    return end

# function to find an index corresponding
# to the first value greater than current
# it is useful when one has timeseries with non-unique
# but monotonically increasing values
def nextInd(series, date):
    end = series.index.searchsorted(date, side="right")
    return end

def twoColumnsRoll(dFrame, offset, usecols, fn, columnName = 'twoColRol'):
    # find all unique indices
    uniqueIndices = dFrame.index.unique()
    numOfPoints = len(uniqueIndices)
    # prepare an output array
    moving = np.zeros(numOfPoints)
    # nameholders
    price = dFrame[usecols[0]]
    qty   = dFrame[usecols[1]]

    # iterate over unique indices
    for ii in range(numOfPoints):
        # nameholder
        pp = uniqueIndices[ii]
        # right index - value greater than current
        rInd = afta.nextInd(dFrame,pp)
        # left index - the least value that 
        # is bigger or equal than (pp - offset)
        lInd = afta.prevInd(dFrame,offset,pp)
        # call the actual calcuating function over two arrays
        moving[ii] = fn(price[lInd:rInd], qty[lInd:rInd])
    # construct and return DataFrame
    return pd.DataFrame(data=moving,index=uniqueIndices,columns=[columnName])

This code works, but it is relatively slow and inefficient. I suppose one can use numpy.lib.stride_tricks from How to invoke pandas.rolling.apply with parameters from multiple column? to speedup things.
However, go big or go home – I ended writing a function in C++ and a wrapper for it.

I’d like not to post it as answer, since it is a workaround and I have not answered neither part of my question, but it is too long for a commentary.

Answered By: Suthiro

How about this:

def masscenter(ser):
    print(df.loc[ser.index])
    return 0

rol = df.price.rolling(window=2)
rol.apply(masscenter, raw=False)

It uses the rolling logic to get subsets from an arbitrary column. The raw=False option provides you with index values for those subsets (which are given to you as Series), then you use those index values to get multi-column slices from your original DataFrame.

Answered By: adr

With reference to the excellent answer from @saninstein.

Install numpy_ext from: https://pypi.org/project/numpy-ext/

import numpy as np
import pandas as pd
from numpy_ext import rolling_apply as rolling_apply_ext

def box_sum(a,b):
    return np.sum(a) + np.sum(b)

df = pd.DataFrame({"x": [1,2,3,4], "y": [1,2,3,4]})

window = 2
df["sum"] = rolling_apply_ext(box_sum, window , df.x.values, df.y.values)

Output:

print(df.to_string(index=False))
 x  y  sum
 1  1  NaN
 2  2  6.0
 3  3 10.0
 4  4 14.0

Notes

  • The rolling function is timeseries friendly. It defaults to always looking backwards, so the 6 is the sum of present and past values in the array.
  • In the sample above, imported rolling_apply as rolling_apply_ext so it cannot possibly interfere with any existing calls to Pandas rolling_apply (thanks to comment by @LudoSchmidt).

As a side note, I gave up trying to use Pandas. It’s fundamentally broken: it handles single-column aggreagate and apply with little problems, but it’s a overly complex rube-goldberg machine when trying to get it to work with more two columns or more.

Answered By: Contango

How about this?

ggg = pd.DataFrame({"a":[1,2,3,4,5,6,7], "b":[7,6,5,4,3,2,1]})

def my_rolling_apply2(df, fun, window):
    prepend = [None] * (window - 1)
    end = len(df) - window
    mid = map(lambda start: fun(df[start:start + window]), np.arange(0,end))
    last =  fun(df[end:])
    return [*prepend, *mid, last]

my_rolling_apply2(ggg, lambda df: (df["a"].max(), df["b"].min()), 3)

And result is:

[None, None, (3, 5), (4, 4), (5, 3), (6, 2), (7, 1)]
Answered By: Anibal Yeh

For performing a rolling window operation with access to all columns of a dataframe, you can pass mehtod='table' to rolling(). Example:

import pandas as pd
import numpy as np
from numba import jit

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], 'b': [1, 3, 5, 7, 9, 11]})

@jit
def f(w):
    # we have access to both columns of the dataframe here
    return np.max(w), np.min(w)

df.rolling(3, method='table').apply(f, raw=True, engine='numba')

It should be noted that method='table' requires numba engine (pip install numba). The @jit part in the example is not mandatory but helps with performance. The result of the above example code will be:

a b
NaN NaN
NaN NaN
5.0 1.0
7.0 2.0
9.0 3.0
11.0 4.0
Answered By: Hamid Fadishei