Why do these two different ways to sum a 2d array have such different performance?

Question:

Consider the following two ways of summing all the values in a 2d numpy array.

import numpy as np
from numba import njit
a = np.random.rand(2, 5000)

@njit(fastmath=True, cache=True)
def sum_array_slow(arr):
    s = 0
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            s += arr[i, j]
    return s
    
@njit(fastmath=True, cache=True)
def sum_array_fast(arr):
    s = 0
    for i in range(arr.shape[1]):
        s += arr[0, i]
    for i in range(arr.shape[1]):
        s += arr[1, i]
    return s

Looking at the nested loop in sum_array_slow it seems it should be performing exactly the same operations in the same order as sum_array_fast. However:

In [46]: %timeit sum_array_slow(a)
7.7 µs ± 374 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [47]: %timeit sum_array_fast(a)
951 ns ± 2.63 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Why is the sum_array_fast function 8 times faster than sum_array_slow when it seems it would be performing the same computations in the same order?

Asked By: Simd

||

Answers:

This is because the slow version is not automatically vectorized (ie. the compiler fails to generate a fast SIMD code), while the fast version is. This is certainly because Numba fails to optimize the index wrapping in the first loop, so it is a missed optimization of Numba.

This can be seen by analysing the assembly code. Here is the hot loop of the slow version:

.LBB0_6:
    addq    %rbx, %rdx
    vaddsd  (%rax,%rdx,8), %xmm0, %xmm0
    leaq    1(%rsi), %rdx
    cmpq    $1, %rbp
    cmovleq %r13, %rdx
    addq    %rbx, %rdx
    vaddsd  (%rax,%rdx,8), %xmm0, %xmm0
    leaq    2(%rsi), %rdx
    cmpq    $2, %rbp
    cmovleq %r13, %rdx
    addq    %rbx, %rdx
    vaddsd  (%rax,%rdx,8), %xmm0, %xmm0
    leaq    3(%rsi), %rdx
    cmpq    $3, %rbp
    cmovleq %r13, %rdx
    addq    $4, %rsi
    leaq    -4(%rbp), %rdi
    addq    %rbx, %rdx
    vaddsd  (%rax,%rdx,8), %xmm0, %xmm0
    cmpq    $4, %rbp
    movl    $0, %edx
    cmovgq  %rsi, %rdx
    movq    %rdi, %rbp
    cmpq    %rsi, %r12
    jne .LBB0_6

We can see that Numba produce many useless index checks which make the loop highly inefficient. I am not aware of any clean way to fix this issue. This is sad since such an issue is far from being rare in practice. Using a native language like C and C++ solves this issue (since there is no index wrapping in arrays). An unsafe/ugly way would be to use pointers in Numba, but extracting the Numpy data pointer and giving it to Numba seems quite a pain (if even possible).

And here is the fast one:

.LBB0_8:
    vaddpd  (%r11,%rsi,8), %ymm0, %ymm0
    vaddpd  32(%r11,%rsi,8), %ymm1, %ymm1
    vaddpd  64(%r11,%rsi,8), %ymm2, %ymm2
    vaddpd  96(%r11,%rsi,8), %ymm3, %ymm3
    vaddpd  128(%r11,%rsi,8), %ymm0, %ymm0
    vaddpd  160(%r11,%rsi,8), %ymm1, %ymm1
    vaddpd  192(%r11,%rsi,8), %ymm2, %ymm2
    vaddpd  224(%r11,%rsi,8), %ymm3, %ymm3
    vaddpd  256(%r11,%rsi,8), %ymm0, %ymm0
    vaddpd  288(%r11,%rsi,8), %ymm1, %ymm1
    vaddpd  320(%r11,%rsi,8), %ymm2, %ymm2
    vaddpd  352(%r11,%rsi,8), %ymm3, %ymm3
    vaddpd  384(%r11,%rsi,8), %ymm0, %ymm0
    vaddpd  416(%r11,%rsi,8), %ymm1, %ymm1
    vaddpd  448(%r11,%rsi,8), %ymm2, %ymm2
    vaddpd  480(%r11,%rsi,8), %ymm3, %ymm3
    addq    $64, %rsi
    addq    $-4, %rdi
    jne .LBB0_8

In this case, the loops is well optimized. In fact, it is nearly optimal for large arrays. For small arrays, like in your example, it is not optimal on some processor like mine. Indeed, AFAIK, the unrolled instructions does not use enough registers so to hide the latency of the FMA unit (this is because LLVM generates a sub-optimal code internally). A lower-level native code is likely required to fix this (at least, there is no simple way to fix this in Numba).


Update

Thank to this link provided by @max9111, the slow code can be optimized by using unsigned integers. This trick drastically improves the execution time. Here is the modified code:

@njit(fastmath=True, cache=True)
def sum_array_faster(arr):
    s = 0
    for i in range(np.uint64(arr.shape[0])):
        for j in range(np.uint64(arr.shape[1])):
            s += arr[i, j]
    return s

Here is the performance on an Intel Xeon W-2255 processor:

slow:     9.66 µs
fastest:  1.13 µs
fast:     1.14 µs

Theoretical optimal:  0.30-0.35 µs

The workaround of replacing opt=0 by opt=2 (thanks to @max911 again) does not gives great results on my machine:

slow:     2.12 µs
fastest:  2.17 µs
fast:     2.09 µs

Not to mention the compilation time is also slightly bigger.

A faster implementation can be implemented so to better hide the latency of the the FMA instruction:

@njit(fastmath=True, cache=True)
def sum_array_fastest(arr):
    s0, s1 = 0, 0
    for i in range(arr.shape[1]):
        s0 += arr[0, i]
        s1 += arr[1, i]
    return s0 + s1

This one takes 1.08 µs. It is better.

There is still two limiting factors with the generated Numba code:

  • the overhead of Numba is significant compared to the (short) execution time: 250-300 ns
  • Numba does not make use of AVX-512 available on my machine (zmm registers are twice bigger than AVX ones).

Note that the assembly code can be extracted using the method inspect_asm of Numba functions.

Answered By: Jérôme Richard
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.