Increase performance of df.rolling(…).apply(…) for large dataframes

Question:

Execution time of this code is too long.

df.rolling(window=255).apply(myFunc)

My dataframes shape is (500, 10000).

                   0         1 ... 9999
2021-11-01  0.011111  0.054242 
2021-11-04  0.025244  0.003653 
2021-11-05  0.524521  0.099521 
2021-11-06  0.054241  0.138321 
...

I make the calculation for each date with the last 255 date values. myFunc looks like:

def myFunc(x):
   coefs = ...
   return np.sqrt(np.sum(x ** 2 * coefs))

I tried to use swifter but performances are the same :

import swifter
df.swifter.rolling(window=255).apply(myFunc)

I also tried with Dask, but I think I didn’t understand it well because the performance are not much better:

import dask.dataframe as dd
ddf = dd.from_pandas(df)
ddf = ddf.rolling(window=255).apply(myFunc, raw=False)
ddf.execute()

I didn’t manage to parallelize the execution with partitions. How can I use dask to improve performance ? I’m on Windows.

Asked By: NCall

||

Answers:

This can be done using numpy+numba pretty efficiently.

Quick MRE:

import numpy as np, pandas as pd, numba

df = pd.DataFrame(
    np.random.random(size=(500, 10000)),
    index=pd.date_range("2021-11-01", freq="D", periods=500)
)

coefs = np.random.random(size=255)

Write the function using pure numpy operations and simple loops, making use of numba.njit(parallel=True) and numba.prange:

@numba.njit(parallel=True)
def numba_func(values, coefficients):
    # define result array: size of original, minus length of
    # coefficients, + 1
    result_tmp = np.zeros(
        shape=(values.shape[0] - len(coefficients) + 1, values.shape[1]),
        dtype=values.dtype,
    )

    result_final = np.empty_like(result_tmp)

    # nested for loops are your friend with numba!
    # (you must unlearn what you have learned)
    for j in numba.prange(values.shape[1]):
        for i in range(values.shape[0] - len(coefficients) + 1):
            for k in range(len(coefficients)):
                result_tmp[i, j] += values[i + k, j] ** 2 * coefficients[k]

        result_final[:, j] = np.sqrt(result_tmp[:, j])

    return result_final

This runs very quickly:

In [5]: %%time
   ...: result = pd.DataFrame(
   ...:     numba_func(df.values, coefs),
   ...:     index=df.index[len(coefs) - 1:],
   ...: )
   ...:
   ...:
CPU times: user 1.69 s, sys: 40.9 ms, total: 1.73 s
Wall time: 844 ms

Note: I’m a huge fan of dask. But the first rule of dask performance is don’t use dask. If it’s small enough to fit comfortably into memory, you’ll usually get the best performance from tuning your pandas or numpy operations and leveraging speedups from cython, numba, etc. And once a problem is big enough to move to dask, these same tuning rules apply to the operations you perform on dask chunks/partitions, too!

Answered By: Michael Delgado

First, since you are using numpy functions, specify the parameter raw=True. Toy example:

import pandas as pd
import numpy as np

def foo(x):
    coefs = 2
    return np.sqrt(np.sum(x ** 2 * coefs))    

df = pd.DataFrame(np.random.random((500, 10000)))

%%time
res = df.rolling(250).apply(foo)

Wall time: 359.3 s

# with raw=True
%%time
res = df.rolling(250).apply(foo, raw=True)

Wall time: 15.2 s

You can also easily parallelize your calculations using the parallel-pandas library. Only two additional lines of code!

# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)

def foo(x):
    coefs = 2
    return np.sqrt(np.sum(x ** 2 * coefs))    

df = pd.DataFrame(np.random.random((500, 1000)))

# p_apply - is parallel analogue of apply method
%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes')

Wall time: 2.2 s

With engine='numba'

%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes', engine='numba')

Wall time: 1.2 s

Total speedup is 359/1.2 ~ 300!

Answered By: padu
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.