Increase performance of df.rolling(…).apply(…) for large dataframes
Question:
Execution time of this code is too long.
df.rolling(window=255).apply(myFunc)
My dataframes shape is (500, 10000).
0 1 ... 9999
2021-11-01 0.011111 0.054242
2021-11-04 0.025244 0.003653
2021-11-05 0.524521 0.099521
2021-11-06 0.054241 0.138321
...
I make the calculation for each date with the last 255 date values. myFunc looks like:
def myFunc(x):
coefs = ...
return np.sqrt(np.sum(x ** 2 * coefs))
I tried to use swifter but performances are the same :
import swifter
df.swifter.rolling(window=255).apply(myFunc)
I also tried with Dask, but I think I didn’t understand it well because the performance are not much better:
import dask.dataframe as dd
ddf = dd.from_pandas(df)
ddf = ddf.rolling(window=255).apply(myFunc, raw=False)
ddf.execute()
I didn’t manage to parallelize the execution with partitions. How can I use dask to improve performance ? I’m on Windows.
Answers:
This can be done using numpy
+numba
pretty efficiently.
Quick MRE:
import numpy as np, pandas as pd, numba
df = pd.DataFrame(
np.random.random(size=(500, 10000)),
index=pd.date_range("2021-11-01", freq="D", periods=500)
)
coefs = np.random.random(size=255)
Write the function using pure numpy operations and simple loops, making use of numba.njit(parallel=True)
and numba.prange
:
@numba.njit(parallel=True)
def numba_func(values, coefficients):
# define result array: size of original, minus length of
# coefficients, + 1
result_tmp = np.zeros(
shape=(values.shape[0] - len(coefficients) + 1, values.shape[1]),
dtype=values.dtype,
)
result_final = np.empty_like(result_tmp)
# nested for loops are your friend with numba!
# (you must unlearn what you have learned)
for j in numba.prange(values.shape[1]):
for i in range(values.shape[0] - len(coefficients) + 1):
for k in range(len(coefficients)):
result_tmp[i, j] += values[i + k, j] ** 2 * coefficients[k]
result_final[:, j] = np.sqrt(result_tmp[:, j])
return result_final
This runs very quickly:
In [5]: %%time
...: result = pd.DataFrame(
...: numba_func(df.values, coefs),
...: index=df.index[len(coefs) - 1:],
...: )
...:
...:
CPU times: user 1.69 s, sys: 40.9 ms, total: 1.73 s
Wall time: 844 ms
Note: I’m a huge fan of dask. But the first rule of dask performance is don’t use dask. If it’s small enough to fit comfortably into memory, you’ll usually get the best performance from tuning your pandas or numpy operations and leveraging speedups from cython, numba, etc. And once a problem is big enough to move to dask, these same tuning rules apply to the operations you perform on dask chunks/partitions, too!
First, since you are using numpy
functions, specify the parameter raw=True
. Toy example:
import pandas as pd
import numpy as np
def foo(x):
coefs = 2
return np.sqrt(np.sum(x ** 2 * coefs))
df = pd.DataFrame(np.random.random((500, 10000)))
%%time
res = df.rolling(250).apply(foo)
Wall time: 359.3 s
# with raw=True
%%time
res = df.rolling(250).apply(foo, raw=True)
Wall time: 15.2 s
You can also easily parallelize your calculations using the parallel-pandas library. Only two additional lines of code!
# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)
def foo(x):
coefs = 2
return np.sqrt(np.sum(x ** 2 * coefs))
df = pd.DataFrame(np.random.random((500, 1000)))
# p_apply - is parallel analogue of apply method
%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes')
Wall time: 2.2 s
With engine='numba'
%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes', engine='numba')
Wall time: 1.2 s
Total speedup is 359/1.2 ~ 300
!
Execution time of this code is too long.
df.rolling(window=255).apply(myFunc)
My dataframes shape is (500, 10000).
0 1 ... 9999
2021-11-01 0.011111 0.054242
2021-11-04 0.025244 0.003653
2021-11-05 0.524521 0.099521
2021-11-06 0.054241 0.138321
...
I make the calculation for each date with the last 255 date values. myFunc looks like:
def myFunc(x):
coefs = ...
return np.sqrt(np.sum(x ** 2 * coefs))
I tried to use swifter but performances are the same :
import swifter
df.swifter.rolling(window=255).apply(myFunc)
I also tried with Dask, but I think I didn’t understand it well because the performance are not much better:
import dask.dataframe as dd
ddf = dd.from_pandas(df)
ddf = ddf.rolling(window=255).apply(myFunc, raw=False)
ddf.execute()
I didn’t manage to parallelize the execution with partitions. How can I use dask to improve performance ? I’m on Windows.
This can be done using numpy
+numba
pretty efficiently.
Quick MRE:
import numpy as np, pandas as pd, numba
df = pd.DataFrame(
np.random.random(size=(500, 10000)),
index=pd.date_range("2021-11-01", freq="D", periods=500)
)
coefs = np.random.random(size=255)
Write the function using pure numpy operations and simple loops, making use of numba.njit(parallel=True)
and numba.prange
:
@numba.njit(parallel=True)
def numba_func(values, coefficients):
# define result array: size of original, minus length of
# coefficients, + 1
result_tmp = np.zeros(
shape=(values.shape[0] - len(coefficients) + 1, values.shape[1]),
dtype=values.dtype,
)
result_final = np.empty_like(result_tmp)
# nested for loops are your friend with numba!
# (you must unlearn what you have learned)
for j in numba.prange(values.shape[1]):
for i in range(values.shape[0] - len(coefficients) + 1):
for k in range(len(coefficients)):
result_tmp[i, j] += values[i + k, j] ** 2 * coefficients[k]
result_final[:, j] = np.sqrt(result_tmp[:, j])
return result_final
This runs very quickly:
In [5]: %%time
...: result = pd.DataFrame(
...: numba_func(df.values, coefs),
...: index=df.index[len(coefs) - 1:],
...: )
...:
...:
CPU times: user 1.69 s, sys: 40.9 ms, total: 1.73 s
Wall time: 844 ms
Note: I’m a huge fan of dask. But the first rule of dask performance is don’t use dask. If it’s small enough to fit comfortably into memory, you’ll usually get the best performance from tuning your pandas or numpy operations and leveraging speedups from cython, numba, etc. And once a problem is big enough to move to dask, these same tuning rules apply to the operations you perform on dask chunks/partitions, too!
First, since you are using numpy
functions, specify the parameter raw=True
. Toy example:
import pandas as pd
import numpy as np
def foo(x):
coefs = 2
return np.sqrt(np.sum(x ** 2 * coefs))
df = pd.DataFrame(np.random.random((500, 10000)))
%%time
res = df.rolling(250).apply(foo)
Wall time: 359.3 s
# with raw=True
%%time
res = df.rolling(250).apply(foo, raw=True)
Wall time: 15.2 s
You can also easily parallelize your calculations using the parallel-pandas library. Only two additional lines of code!
# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)
def foo(x):
coefs = 2
return np.sqrt(np.sum(x ** 2 * coefs))
df = pd.DataFrame(np.random.random((500, 1000)))
# p_apply - is parallel analogue of apply method
%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes')
Wall time: 2.2 s
With engine='numba'
%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes', engine='numba')
Wall time: 1.2 s
Total speedup is 359/1.2 ~ 300
!