Adding single integer to numpy array faster if single integer has python-native int type

Question:

I add a single integer to an array of integers with 1000 elements. This is faster by 25% when I first cast the single integer from numpy.int64 to the python-native int.

Why? Should I, as a general rule of thumb convert the single number to native python formats for single-number-to-array operations with arrays of about this size?

Note: may be related to my previous question Conjugating a complex number much faster if number has python-native complex type.

import numpy as np

nnu = 10418
nnu_use = 5210
a = np.random.randint(nnu,size=1000)
b = np.random.randint(nnu_use,size=1)[0]

%timeit a + b                            # --> 3.9 µs ± 19.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a + int(b)                       # --> 2.87 µs ± 8.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note that the speed-up can be enormous (factor 50) for scalar-to-scalar-operations as well, as seen below:

np.random.seed(100)

a = (np.random.rand(1))[0]
a_native = float(a)
b = complex(np.random.rand(1)+1j*np.random.rand(1))
c = (np.random.rand(1)+1j*np.random.rand(1))[0]
c_native = complex(c)

%timeit a * (b - b.conjugate() * c)                # 6.48 µs ± 49.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a_native * (b - b.conjugate() * c_native)  # 283 ns ± 7.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a * b                                      # 5.07 µs ± 17.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a_native * b                               # 94.5 ns ± 0.868 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Update: Could it be that the latest numpy release fixes the speed difference? The release notes of numpy 1.23 mention that scalar operations are now much faster, see https://numpy.org/devdocs/release/1.23.0-notes.html#performance-improvements-and-changes and https://github.com/numpy/numpy/pull/21188. I am using python 3.7.6, numpy 1.21.2.

Asked By: bproxauf

||

Answers:

On my Windows PC with CPython 3.8.1, I get:

[Old] Numpy 1.22.4:
 - First test: 1.65 µs VS 1.43 µs
 - Second:     2.03 µs VS 0.17 µs

[New] Numpy 1.23.1:
 - First test: 1.38 µs VS 1.24 µs    <----  A bit better than Numpy 1.22.4
 - Second:     0.38 µs VS 0.17 µs    <----  Much better than Numpy 1.22.4

While the new version of Numpy gives a good boost, native type should always be faster than Numpy ones with the (default) CPython interpreter. Indeed, the interpreter needs to call C function of Numpy. This is not needed with native types. Additionally, the Numpy checks and wrapping is not optimal but Numpy is not designed for fast scalar computation in the first place (though the overhead was previously not reasonable). In fact, scalar computations are very inefficient and the interpreter prevent any fast execution.

If you plan to do many scalar operation you need to use a natively compiled code, possibly using Cython, Numba, or even a raw C/C++ module. Note that Cython do not optimize/inline Numpy calls but can operate on native types faster. A native code can do this certainly in one or even two order of magnitude less time.

Note that in the first case, the path in Numpy functions is not the same and Numpy does additional check that are a bit more expensive then the value is not a CPython object. Still, it should be a constant overhead (and now relatively small). Otherwise, it would be a bug (and should be reported).

Related: Why is np.sum(range(N)) very slow?

Answered By: Jérôme Richard
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.