Performance penalty when overriding Numpy's __array_function__() method

Question:

I wrote an array-like class ‘Vector’ which behaves like an ‘np.ndarray’ but has a few extra attributes and methods to be used in a geometry engine (which are omitted here).

The MVP below overrides ‘__ array_function __()’ to ensure that a Vector object is returned when using the np.dot function.

When I benchmarked my code against plain-vanilla np.array objects, I noticed a severe performance hit:

import numpy as np
from timeit import timeit


class Vector(np.ndarray):

    def __new__(cls, input_array):
        return np.array(input_array).view(cls)

    def __array_function__(self, func, types, args, kwargs):
        if func == np.dot:
            out = np.dot(np.asarray(args[0]), np.asarray(args[1]))
            return out.view(Vector)

Benchmark:

v = Vector([1, 1, 1])
I = np.identity(3)

print(type(np.dot(I, v)))  # Make sure it returns the correct type.

# Create a np.array and Vector object.
A = np.random.random((100, 3))
V = A.view(Vector)

# Compare np.dot speed.
print(timeit(lambda: np.dot(I, A.T)))
print(timeit(lambda: np.dot(I, V.T)))

The above code outputs:

<class '__main__.Vector'>
1.207045791001292
2.063941927997803

Indicating a 70 % performance hit. Is this expected? Am I doing something wrong? Is there a way around this (I’m only interested in np.dot and np.cross)?

If not, I’m afraid I’ll have to abandon my custom classes.

Asked By: CyrielN

||

Answers:

This is expected since the target arrays are very small and the overhead of calling a pure-Python function is big compared to the computation time taken by np.dot on a basic array.


Indeed, np.dot(I, A.T) takes just about few microseconds: 1.7 µs on my machine. A significant part of the time is lost in Numpy overheads and the actual computation should take just a fraction of this execution time. np.dot(I, V.T) has to call the pure-Python function __array_function__ and this function takes about 1.2 us. The overall runtime is thus 2.9 us, hence a 70% slower execution.

__array_function__ is a bit slow because it is interpreted (assuming you use the standard CPython interpreter) while usual Numpy functions are written in C and so they are compiled to native code. Interpreted codes are significantly slower (due to nearly no optimizations, dynamic typing, many dynamic allocations, object wrapping, etc.) not to mention the 2 calls to np.asarray takes a significant additional time compared to just calling np.dot directly.

One solution to reduce the overhead is to use Cython. Cython can compile a pure-Python function to native code. The compiled code can be much faster if type annotation are present. That being said, the benefit of using Cython here is limited. Indeed, half the overhead comes from the Numpy internals when calling np.dot. This is certainly because Numpy has to create Python objects (eg. args) so to pass them to the pure-Python function and also because Numpy and CPython has to perform few check (eg. check the function __array_function__ is actually valid). AFAIK, there is not much you can do about this Numpy overhead.

In the end, since >75% of the execution time of np.dot(I, A.T) is already overheads it is certainly better to rewrite the code calling this expression so calls are vectorized. Indeed, calling __array_function__ once is not really a problem. This means you may need to write a class to manage many vectors. If the vectors are of different size, then the overhead can still be significant (Numpy is not great for that).

Answered By: Jérôme Richard