Why is repeated numpy array access faster using a single-element view?

Question:

I saw in another SO thread that it’s possible to create a single-element view of an array arr with arr[index:index+1]. This is useful to me since I need to set several values of a (possibly large ~100k entries) array repeatedly. But before I just used that approach, I wanted to make sure that extra work of creating the view doesn’t cost too much time.
Surprisingly, I’ve found that if you access an index at least ~10 times, you’re already better off using the view.
Comparison of time taken by accessing a numpy array repeatedly with an index or a single-element-view

The data for this plot was created by timing the two aproaches (in python 3.10):

#!/bin/python3
# https://gist.github.com/SimonLammer/7f27fd641938b4a8854b55a3851921db

from datetime import datetime, timedelta
import numpy as np
import timeit

np.set_printoptions(linewidth=np.inf, formatter={'float': lambda x: format(x, '1.5E')})

def indexed(arr, indices, num_indices, accesses):
    s = 0
    for index in indices[:num_indices]:
        for _ in range(accesses):
            s += arr[index]

def viewed(arr, indices, num_indices, accesses):
    s = 0
    for index in indices[:num_indices]:
        v = arr[index:index+1]
        for _ in range(accesses):
            s += v[0]
    return s

N = 11_000 # Setting this higher doesn't seem to have significant effect
arr = np.random.randint(0, N, N)
indices = np.random.randint(0, N, N)

options = [1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946]
for num_indices in options:
    for accesses in options:
        print(f"{num_indices=}, {accesses=}")
        for func in ['indexed', 'viewed']:
            t = np.zeros(5)
            end = datetime.now() + timedelta(seconds=2.5)
            i = 0
            while i < 5 or datetime.now() < end:
                t += timeit.repeat(f'{func}(arr, indices, num_indices, accesses)', number=1, globals=globals())
                i += 1
            t /= i
            print(f"  {func.rjust(7)}:", t, f"({i} runs)")

These observations are very counterintuitive to me.
Why is viewed faster than indexed (for more than 10 accesses per index)?


Edit 1:


Edit 2:

I could replicate Jérôme Richard’s findings. The culprit is the index datatype (python int vs. numpy int):

>>> import timeit
>>> timeit.timeit('arr[i]', setup='import numpy as np; arr = np.random.randint(0, 1000, 1000); i = np.random.randint(0, len(arr), 1)[0]', number=20000000)
1.618339812999693
>>> timeit.timeit('arr[i]', setup='import numpy as np; arr = np.random.randint(0, 1000, 1000); i = np.random.randint(0, len(arr), 1)[0]; i = int(i)', number=20000000)
1.2747555710002416
Asked By: Simon L

||

Answers:

Update: I can’t replicate the timings of this answer anymore. Maybe I have done something in a setup step that changed these results; or they were just coincidence.

>>> arr = np.random.randint(0, 1000, 1000)
>>> i = 342
>>> def a3(i): return arr[i]
...
>>> def b3(i): return arr[342]
...
>>> def c3(i): return arr[0]
...
>>> t = timeit.repeat('a3(i)', globals=globals(), number=100000000); print(t, np.mean(t), np.median(t))
[17.449311104006483, 17.405843814995023, 17.91914719599299, 18.123263651999878, 18.04744581299019] 17.789002315996914 17.91914719599299
>>> t = timeit.repeat('b3(i)', globals=globals(), number=100000000); print(t, np.mean(t), np.median(t))
[17.55685576199903, 18.099313585989876, 18.032570399998804, 18.153590378991794, 17.628647994992207] 17.894195624394342 18.032570399998804
>>> t = timeit.repeat('c3(i)', globals=globals(), number=100000000); print(t, np.mean(t), np.median(t))
[17.762766532003297, 17.826293045000057, 17.821444382003392, 17.618322997994255, 17.488862683996558] 17.703537928199513 17.762766532003297

The timing difference seems to be caused by loading a variable vs. loading a constant.

import numpy as np
import dis
arr = np.random.randint(0, 1000, 1000)

def a3(i):
    return arr[i]
def b3(i):
    return arr[342]
def c3(i):
    return arr[0]

The difference in these functions is just the way of indexing the array with i, 342 or 0.

>>> dis.dis(a3)
  2           0 LOAD_GLOBAL              0 (arr)
              2 LOAD_FAST                0 (i)
              4 BINARY_SUBSCR
              6 RETURN_VALUE
>>> dis.dis(b3)                                                                   
  2           0 LOAD_GLOBAL              0 (arr)
              2 LOAD_CONST               1 (342)
              4 BINARY_SUBSCR
              6 RETURN_VALUE
>>> dis.dis(c3)                                                                   
  2           0 LOAD_GLOBAL              0 (arr)
              2 LOAD_CONST               1 (0)
              4 BINARY_SUBSCR
              6 RETURN_VALUE

The variable index is (~8%) slower than a constant index, and a constant index 0 is (~5%) faster still. Accessing the array at index 0 (c3) is (~13%) faster than the variable index (a3).

>>> t = timeit.repeat('a3(i)', globals=globals(), number=10000000); print(t, np.mean(t), np.median(t))
[1.4897515250049764, 1.507482559987693, 1.5573357169923838, 1.581711255988921, 1.588776800010237] 1.5450115715968422 1.5573357169923838
>>> t = timeit.repeat('b3(i)', globals=globals(), number=10000000); print(t, np.mean(t), np.median(t))
[1.4514476449985523, 1.427873961001751, 1.4268056689907098, 1.4114146630017785, 1.442651974997716] 1.4320387825981016 1.427873961001751
>>> t = timeit.repeat('c3(i)', globals=globals(), number=10000000); print(t, np.mean(t), np.median(t))
[1.357518576012808, 1.3500928360008402, 1.3615708220022498, 1.376022889991873, 1.3813936790102161] 1.3653197606035974 1.3615708220022498

Thanks to u/jtclimb https://www.reddit.com/r/Numpy/comments/wb4p12/comment/ii7q53s/?utm_source=share&utm_medium=web2x&context=3


Edit 1: Using the setup parameter of timeit.repeat refutes this hypothesis.

>>> t=timeit.repeat('arr[i]', setup='import numpy as np; arr = np.random.randint(0,10000,1000000); i = 342', number=10000000); print(np.around(t, 5), np.mean(t), np.median(t))
[0.7697  0.76627 0.77007 0.76424 0.76788] 0.7676320286031114 0.7678760859998874
>>> t=timeit.repeat('arr[0]', setup='import numpy as np; arr = np.random.randint(0,10000,1000000); i = 342', number=10000000); print(np.around(t, 5), np.mean(t), np.median(t))
[0.76836 0.76629 0.76794 0.76619 0.7682 ] 0.7673966443951941 0.7679443680099212
Answered By: Simon L

Since num_indices have not significant impact on the observed performance, we can simplify the problem by discarding this parameter (ie. set to 1). Since only large accesses matters, we can also simplify the problem by considering only a large value like 10946 for example. The use of index can also be simplified without affecting the benchmark. The same thing applies for the return statement. The simplified problem is now why we get this (reproduced on CPython 3.10.5):

import numpy as np

def indexed(arr, index):
    s = 0
    for _ in range(10946): s += arr[index]

def viewed(arr, index):
    s = 0
    v = arr[index:index+1]
    for _ in range(10946): s += v[0]

N = 11_000
arr = np.random.randint(0, N, N)
indices = np.random.randint(0, N, N)

# mean ± std. dev. of 7 runs, 1000 loops each
%timeit indexed(arr, indices[0])       # 1.24 ms ± 22.3 µs per loop
%timeit viewed(arr, indices[0])        # 0.99 ms ± 4.34 µs per loop

Now, the source of the slowdown is pretty limited. It only has to do with arr[index] versus v[0]. It is also important to note that arr and v are basically of the same type meanwhile index and 0 are not of the same type. Indeed, index if of type np.int64 while 0 is a PyLong object. The thing is Numpy item types are much slower than builtin ones. Put it shortly, CPython functions are called by Numpy in both cases but PyLong objects can be computed faster than generic PyNumber-based datatypes (by CPython itself). Numpy checks are also ordered so the first type can be also faster. For more information, see Section "*
Deeper analysis & Discussion*".

To fix the problem, you can just convert the Numpy type to a builtin one:

import numpy as np

def indexed(arr, index):
    s = 0
    nativeIndex = int(index)  # <------------------------------
    for _ in range(10946): s += arr[nativeIndex]

def viewed(arr, index):
    s = 0
    v = arr[index:index+1]
    for _ in range(10946): s += v[0]

N = 11_000
arr = np.random.randint(0, N, N)
indices = np.random.randint(0, N, N)

# mean ± std. dev. of 7 runs, 1000 loops each
%timeit indexed(arr, indices[0])       # 981 µs ± 4.6 µs per loop
%timeit viewed(arr, indices[0])        # 989 µs ± 5.3 µs per loop
# The difference is smaller than the sum of the standard deviations 
# so the gap is clearly not statistically significant anymore.

Besides, note that nearly all the time of the two functions is pure overhead. Numpy is not designed for doing scalar access, but optimized for vectorized ones. A naive expression like v[0] cause a LOT of work to be done by the processor: the expression needs to be interpreted, a new reference-counted object need to be allocated, several (internal) functions of Numpy needs to be called with many overheads (wrapping, dynamic type checking, internal iterator configuration). Low-level profilers reports dozens of function calls to be called and 250-300 cycles wasted for something that should take no more than 1 cycle of throughput on a modern x86 processor.


Deeper analysis & Discussion

When the expression arr[nativeIndex] (or v[0]) are evaluated, the CPython interpreter end up calling Numpy functions and especially the generic array_subscript. This function calls prepare_index which do some type checks in the operand. This function is slower with non built-in types. One reason is that PyArray_PyIntAsIntp calls CPython type-checking functions and the code for checking PyLong type is faster than the much more generic alternative that "covers everything" (including np.int64, to quote a comment in the code). Additionally, there is a cascade of checks and types causing early breaks tends to also results in a faster check stage (because of the cumulative cost of all CPython function calls). Because of that, Numpy developers tends to put more common cases first. That being said, this is not so simple because the cost of checking each type is not the same (so putting a common type slow to check first can slow down all other cases) and there are sometimes dependencies in the code. Also please note that the CPython interface prevent Numpy to do further optimizations (that CPython does in builtins) and make the more generic type also a bit slower because of this. In fact, this is what makes builtin function faster (though it is generally not the only reason).


Related post: Why is np.sum(range(N)) very slow?

Answered By: Jérôme Richard