Pandas mask / where methods versus NumPy np.where
Question:
I often use Pandas mask
and where
methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a significant performance drop relative to numpy.where
.
While I’m happy to accept this for specific cases, I’m interested to know:
- Do Pandas
mask
/ where
methods offer any additional functionality, apart from inplace
/ errors
/ try-cast
parameters? I understand those 3 parameters but rarely use them. For example, I have no idea what the level
parameter refers to.
- Is there any non-trivial counter-example where
mask
/ where
outperforms numpy.where
? If such an example exists, it could influence how I choose appropriate methods going forwards.
For reference, here’s some benchmarking on Pandas 0.19.2 / Python 3.6.0:
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
assert (df[0].mask(df[0] > 0.5, 1).values == np.where(df[0] > 0.5, 1, df[0])).all()
%timeit df[0].mask(df[0] > 0.5, 1) # 145 ms per loop
%timeit np.where(df[0] > 0.5, 1, df[0]) # 113 ms per loop
The performance appears to diverge further for non-scalar values:
%timeit df[0].mask(df[0] > 0.5, df[0]*2) # 338 ms per loop
%timeit np.where(df[0] > 0.5, df[0]*2, df[0]) # 153 ms per loop
Answers:
I’m using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.
But let’s investigate a slightly different version of your second example (so we get2*df[0]
out of the way). Here is our baseline on my machine:
twice = df[0]*2
mask = df[0] > 0.5
%timeit np.where(mask, twice, df[0])
# 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[0].mask(mask, twice)
# 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy’s version is about 2.3 times faster than pandas.
So let’s profile both functions to see the difference – profiling is a good way to get the big picture when one isn’t very familiar with the code basis: it is faster than debugging and less error-prone than trying to figure out what’s going on just by reading the code.
I’m on Linux and use perf
. For the numpy’s version we get (for the listing see appendix A):
>>> perf record python np_where.py
>>> perf report
Overhead Command Shared Object Symbol
68,50% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] PyArray_Where
8,96% python [unknown] [k] 0xffffffff8140290c
1,57% python mtrand.cpython-36m-x86_64-linux-gnu.so [.] rk_random
As we can see, the lion’s share of the time is spent in PyArray_Where
– about 69%. The unknown symbol is a kernel function (as matter of fact clear_page
) – I run without root privileges so the symbol is not resolved.
And for pandas we get (see Appendix B for code):
>>> perf record python pd_mask.py
>>> perf report
Overhead Command Shared Object Symbol
37,12% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task
23,36% python libc-2.23.so [.] __memmove_ssse3_back
19,78% python [unknown] [k] 0xffffffff8140290c
3,32% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan
1,48% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
Quite a different situation:
- pandas doesn’t use
PyArray_Where
under the hood – the most prominent time-consumer is vm_engine_iter_task
, which is numexpr-functionality.
- there is some heavy memory-copying going on –
__memmove_ssse3_back
uses about 25
% of time! Probably some of the kernel’s functions are also connected to memory-accesses.
Actually, pandas-0.19 used PyArray_Where
under the hood, for the older version the perf-report would look like:
Overhead Command Shared Object Symbol
32,42% python multiarray.so [.] PyArray_Where
30,25% python libc-2.23.so [.] __memmove_ssse3_back
21,31% python [kernel.kallsyms] [k] clear_page
1,72% python [kernel.kallsyms] [k] __schedule
So basically it would use np.where
under the hood + some overhead (all above data-copying, see __memmove_ssse3_back
) back then.
I see no scenario where pandas could become faster than numpy in pandas’ version 0.19 – it just adds overhead to numpy’s functionality. Pandas’ version 0.23.3 is an entirely different story – here numexpr-module is used, it is very possible that there are scenarios for which pandas’ version is (at least slightly) faster.
I’m not sure this memory-copying is really called for/necessary – maybe one even could call it performance-bug, but I just don’t know enough to be certain.
We could help pandas not to copy, by peeling away some indirections (passing np.array
instead of pd.Series
). For example:
%timeit df[0].mask(mask.values > 0.5, twice.values)
# 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, pandas is only 25% slower. The perf says:
Overhead Command Shared Object Symbol
50,81% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task
14,12% python [unknown] [k] 0xffffffff8140290c
9,93% python libc-2.23.so [.] __memmove_ssse3_back
4,61% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan
2,01% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
Much less data-copying, but still more than in the numpy’s version which is mostly responsible for the overhead.
My key take-aways from it:
-
pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas’ somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.
-
when the performance of where
/mask
is the bottleneck, I would use numba/cython to improve the performance – see my rather naive tries to use numba and cython further below.
The idea is to take
np.where(df[0] > 0.5, df[0]*2, df[0])
version and to eliminate the need to create a temporary – i.e, df[0]*2
.
As proposed by @max9111, using numba:
import numba as nb
@nb.njit
def nb_where(df):
n = len(df)
output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
assert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all()
%timeit np.where(df[0] > 0.5, df[0]*2, df[0])
# 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit nb_where(df[0].values)
# 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which is about factor 5 faster than the numpy’s version!
And here is my by far less successful try to improve the performance with help of Cython:
%%cython -a
cimport numpy as np
import numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_where(double[::1] df):
cdef int i
cdef int n = len(df)
cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
assert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all()
%timeit cy_where(df[0].values)
# 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
gives 25% speed-up. Not sure, why cython is so much slower than numba though.
Listings:
A: np_where.py:
import pandas as pd
import numpy as np
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
twice = df[0]*2
for _ in range(50):
np.where(df[0] > 0.5, twice, df[0])
B: pd_mask.py:
import pandas as pd
import numpy as np
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
twice = df[0]*2
mask = df[0] > 0.5
for _ in range(50):
df[0].mask(mask, twice)
I often use Pandas mask
and where
methods for cleaner logic when updating values in a series conditionally. However, for relatively performance-critical code I notice a significant performance drop relative to numpy.where
.
While I’m happy to accept this for specific cases, I’m interested to know:
- Do Pandas
mask
/where
methods offer any additional functionality, apart frominplace
/errors
/try-cast
parameters? I understand those 3 parameters but rarely use them. For example, I have no idea what thelevel
parameter refers to. - Is there any non-trivial counter-example where
mask
/where
outperformsnumpy.where
? If such an example exists, it could influence how I choose appropriate methods going forwards.
For reference, here’s some benchmarking on Pandas 0.19.2 / Python 3.6.0:
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
assert (df[0].mask(df[0] > 0.5, 1).values == np.where(df[0] > 0.5, 1, df[0])).all()
%timeit df[0].mask(df[0] > 0.5, 1) # 145 ms per loop
%timeit np.where(df[0] > 0.5, 1, df[0]) # 113 ms per loop
The performance appears to diverge further for non-scalar values:
%timeit df[0].mask(df[0] > 0.5, df[0]*2) # 338 ms per loop
%timeit np.where(df[0] > 0.5, df[0]*2, df[0]) # 153 ms per loop
I’m using pandas 0.23.3 and Python 3.6, so I can see a real difference in running time only for your second example.
But let’s investigate a slightly different version of your second example (so we get2*df[0]
out of the way). Here is our baseline on my machine:
twice = df[0]*2
mask = df[0] > 0.5
%timeit np.where(mask, twice, df[0])
# 61.4 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[0].mask(mask, twice)
# 143 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy’s version is about 2.3 times faster than pandas.
So let’s profile both functions to see the difference – profiling is a good way to get the big picture when one isn’t very familiar with the code basis: it is faster than debugging and less error-prone than trying to figure out what’s going on just by reading the code.
I’m on Linux and use perf
. For the numpy’s version we get (for the listing see appendix A):
>>> perf record python np_where.py
>>> perf report
Overhead Command Shared Object Symbol
68,50% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] PyArray_Where
8,96% python [unknown] [k] 0xffffffff8140290c
1,57% python mtrand.cpython-36m-x86_64-linux-gnu.so [.] rk_random
As we can see, the lion’s share of the time is spent in PyArray_Where
– about 69%. The unknown symbol is a kernel function (as matter of fact clear_page
) – I run without root privileges so the symbol is not resolved.
And for pandas we get (see Appendix B for code):
>>> perf record python pd_mask.py
>>> perf report
Overhead Command Shared Object Symbol
37,12% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task
23,36% python libc-2.23.so [.] __memmove_ssse3_back
19,78% python [unknown] [k] 0xffffffff8140290c
3,32% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan
1,48% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
Quite a different situation:
- pandas doesn’t use
PyArray_Where
under the hood – the most prominent time-consumer isvm_engine_iter_task
, which is numexpr-functionality. - there is some heavy memory-copying going on –
__memmove_ssse3_back
uses about25
% of time! Probably some of the kernel’s functions are also connected to memory-accesses.
Actually, pandas-0.19 used PyArray_Where
under the hood, for the older version the perf-report would look like:
Overhead Command Shared Object Symbol
32,42% python multiarray.so [.] PyArray_Where
30,25% python libc-2.23.so [.] __memmove_ssse3_back
21,31% python [kernel.kallsyms] [k] clear_page
1,72% python [kernel.kallsyms] [k] __schedule
So basically it would use np.where
under the hood + some overhead (all above data-copying, see __memmove_ssse3_back
) back then.
I see no scenario where pandas could become faster than numpy in pandas’ version 0.19 – it just adds overhead to numpy’s functionality. Pandas’ version 0.23.3 is an entirely different story – here numexpr-module is used, it is very possible that there are scenarios for which pandas’ version is (at least slightly) faster.
I’m not sure this memory-copying is really called for/necessary – maybe one even could call it performance-bug, but I just don’t know enough to be certain.
We could help pandas not to copy, by peeling away some indirections (passing np.array
instead of pd.Series
). For example:
%timeit df[0].mask(mask.values > 0.5, twice.values)
# 75.7 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, pandas is only 25% slower. The perf says:
Overhead Command Shared Object Symbol
50,81% python interpreter.cpython-36m-x86_64-linux-gnu.so [.] vm_engine_iter_task
14,12% python [unknown] [k] 0xffffffff8140290c
9,93% python libc-2.23.so [.] __memmove_ssse3_back
4,61% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_isnan
2,01% python umath.cpython-36m-x86_64-linux-gnu.so [.] BOOL_logical_not
Much less data-copying, but still more than in the numpy’s version which is mostly responsible for the overhead.
My key take-aways from it:
-
pandas has the potential to be at least slightly faster than numpy (because it is possible to be faster). However, pandas’ somewhat opaque handling of data-copying makes it hard to predict when this potential is overshadowed by (unnecessary) data copying.
-
when the performance of
where
/mask
is the bottleneck, I would use numba/cython to improve the performance – see my rather naive tries to use numba and cython further below.
The idea is to take
np.where(df[0] > 0.5, df[0]*2, df[0])
version and to eliminate the need to create a temporary – i.e, df[0]*2
.
As proposed by @max9111, using numba:
import numba as nb
@nb.njit
def nb_where(df):
n = len(df)
output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
assert(np.where(df[0] > 0.5, twice, df[0])==nb_where(df[0].values)).all()
%timeit np.where(df[0] > 0.5, df[0]*2, df[0])
# 85.1 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit nb_where(df[0].values)
# 17.4 ms ± 673 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which is about factor 5 faster than the numpy’s version!
And here is my by far less successful try to improve the performance with help of Cython:
%%cython -a
cimport numpy as np
import numpy as np
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_where(double[::1] df):
cdef int i
cdef int n = len(df)
cdef np.ndarray[np.float64_t] output = np.empty(n, dtype=np.float64)
for i in range(n):
if df[i]>0.5:
output[i] = 2.0*df[i]
else:
output[i] = df[i]
return output
assert (df[0].mask(df[0] > 0.5, 2*df[0]).values == cy_where(df[0].values)).all()
%timeit cy_where(df[0].values)
# 66.7± 753 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
gives 25% speed-up. Not sure, why cython is so much slower than numba though.
Listings:
A: np_where.py:
import pandas as pd
import numpy as np
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
twice = df[0]*2
for _ in range(50):
np.where(df[0] > 0.5, twice, df[0])
B: pd_mask.py:
import pandas as pd
import numpy as np
np.random.seed(0)
n = 10000000
df = pd.DataFrame(np.random.random(n))
twice = df[0]*2
mask = df[0] > 0.5
for _ in range(50):
df[0].mask(mask, twice)