Why do these two different ways to sum a 2d array have such different performance?
Question:
Consider the following two ways of summing all the values in a 2d numpy array.
import numpy as np
from numba import njit
a = np.random.rand(2, 5000)
@njit(fastmath=True, cache=True)
def sum_array_slow(arr):
s = 0
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
s += arr[i, j]
return s
@njit(fastmath=True, cache=True)
def sum_array_fast(arr):
s = 0
for i in range(arr.shape[1]):
s += arr[0, i]
for i in range(arr.shape[1]):
s += arr[1, i]
return s
Looking at the nested loop in sum_array_slow it seems it should be performing exactly the same operations in the same order as sum_array_fast. However:
In [46]: %timeit sum_array_slow(a)
7.7 µs ± 374 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [47]: %timeit sum_array_fast(a)
951 ns ± 2.63 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Why is the sum_array_fast function 8 times faster than sum_array_slow when it seems it would be performing the same computations in the same order?
Answers:
This is because the slow version is not automatically vectorized (ie. the compiler fails to generate a fast SIMD code), while the fast version is. This is certainly because Numba fails to optimize the index wrapping in the first loop, so it is a missed optimization of Numba.
This can be seen by analysing the assembly code. Here is the hot loop of the slow version:
.LBB0_6:
addq %rbx, %rdx
vaddsd (%rax,%rdx,8), %xmm0, %xmm0
leaq 1(%rsi), %rdx
cmpq $1, %rbp
cmovleq %r13, %rdx
addq %rbx, %rdx
vaddsd (%rax,%rdx,8), %xmm0, %xmm0
leaq 2(%rsi), %rdx
cmpq $2, %rbp
cmovleq %r13, %rdx
addq %rbx, %rdx
vaddsd (%rax,%rdx,8), %xmm0, %xmm0
leaq 3(%rsi), %rdx
cmpq $3, %rbp
cmovleq %r13, %rdx
addq $4, %rsi
leaq -4(%rbp), %rdi
addq %rbx, %rdx
vaddsd (%rax,%rdx,8), %xmm0, %xmm0
cmpq $4, %rbp
movl $0, %edx
cmovgq %rsi, %rdx
movq %rdi, %rbp
cmpq %rsi, %r12
jne .LBB0_6
We can see that Numba produce many useless index checks which make the loop highly inefficient. I am not aware of any clean way to fix this issue. This is sad since such an issue is far from being rare in practice. Using a native language like C and C++ solves this issue (since there is no index wrapping in arrays). An unsafe/ugly way would be to use pointers in Numba, but extracting the Numpy data pointer and giving it to Numba seems quite a pain (if even possible).
And here is the fast one:
.LBB0_8:
vaddpd (%r11,%rsi,8), %ymm0, %ymm0
vaddpd 32(%r11,%rsi,8), %ymm1, %ymm1
vaddpd 64(%r11,%rsi,8), %ymm2, %ymm2
vaddpd 96(%r11,%rsi,8), %ymm3, %ymm3
vaddpd 128(%r11,%rsi,8), %ymm0, %ymm0
vaddpd 160(%r11,%rsi,8), %ymm1, %ymm1
vaddpd 192(%r11,%rsi,8), %ymm2, %ymm2
vaddpd 224(%r11,%rsi,8), %ymm3, %ymm3
vaddpd 256(%r11,%rsi,8), %ymm0, %ymm0
vaddpd 288(%r11,%rsi,8), %ymm1, %ymm1
vaddpd 320(%r11,%rsi,8), %ymm2, %ymm2
vaddpd 352(%r11,%rsi,8), %ymm3, %ymm3
vaddpd 384(%r11,%rsi,8), %ymm0, %ymm0
vaddpd 416(%r11,%rsi,8), %ymm1, %ymm1
vaddpd 448(%r11,%rsi,8), %ymm2, %ymm2
vaddpd 480(%r11,%rsi,8), %ymm3, %ymm3
addq $64, %rsi
addq $-4, %rdi
jne .LBB0_8
In this case, the loops is well optimized. In fact, it is nearly optimal for large arrays. For small arrays, like in your example, it is not optimal on some processor like mine. Indeed, AFAIK, the unrolled instructions does not use enough registers so to hide the latency of the FMA unit (this is because LLVM generates a sub-optimal code internally). A lower-level native code is likely required to fix this (at least, there is no simple way to fix this in Numba).
Update
Thank to this link provided by @max9111, the slow code can be optimized by using unsigned integers. This trick drastically improves the execution time. Here is the modified code:
@njit(fastmath=True, cache=True)
def sum_array_faster(arr):
s = 0
for i in range(np.uint64(arr.shape[0])):
for j in range(np.uint64(arr.shape[1])):
s += arr[i, j]
return s
Here is the performance on an Intel Xeon W-2255 processor:
slow: 9.66 µs
fastest: 1.13 µs
fast: 1.14 µs
Theoretical optimal: 0.30-0.35 µs
The workaround of replacing opt=0
by opt=2
(thanks to @max911 again) does not gives great results on my machine:
slow: 2.12 µs
fastest: 2.17 µs
fast: 2.09 µs
Not to mention the compilation time is also slightly bigger.
A faster implementation can be implemented so to better hide the latency of the the FMA instruction:
@njit(fastmath=True, cache=True)
def sum_array_fastest(arr):
s0, s1 = 0, 0
for i in range(arr.shape[1]):
s0 += arr[0, i]
s1 += arr[1, i]
return s0 + s1
This one takes 1.08 µs. It is better.
There is still two limiting factors with the generated Numba code:
- the overhead of Numba is significant compared to the (short) execution time: 250-300 ns
- Numba does not make use of AVX-512 available on my machine (zmm registers are twice bigger than AVX ones).
Note that the assembly code can be extracted using the method inspect_asm
of Numba functions.
Consider the following two ways of summing all the values in a 2d numpy array.
import numpy as np
from numba import njit
a = np.random.rand(2, 5000)
@njit(fastmath=True, cache=True)
def sum_array_slow(arr):
s = 0
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
s += arr[i, j]
return s
@njit(fastmath=True, cache=True)
def sum_array_fast(arr):
s = 0
for i in range(arr.shape[1]):
s += arr[0, i]
for i in range(arr.shape[1]):
s += arr[1, i]
return s
Looking at the nested loop in sum_array_slow it seems it should be performing exactly the same operations in the same order as sum_array_fast. However:
In [46]: %timeit sum_array_slow(a)
7.7 µs ± 374 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [47]: %timeit sum_array_fast(a)
951 ns ± 2.63 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Why is the sum_array_fast function 8 times faster than sum_array_slow when it seems it would be performing the same computations in the same order?
This is because the slow version is not automatically vectorized (ie. the compiler fails to generate a fast SIMD code), while the fast version is. This is certainly because Numba fails to optimize the index wrapping in the first loop, so it is a missed optimization of Numba.
This can be seen by analysing the assembly code. Here is the hot loop of the slow version:
.LBB0_6:
addq %rbx, %rdx
vaddsd (%rax,%rdx,8), %xmm0, %xmm0
leaq 1(%rsi), %rdx
cmpq $1, %rbp
cmovleq %r13, %rdx
addq %rbx, %rdx
vaddsd (%rax,%rdx,8), %xmm0, %xmm0
leaq 2(%rsi), %rdx
cmpq $2, %rbp
cmovleq %r13, %rdx
addq %rbx, %rdx
vaddsd (%rax,%rdx,8), %xmm0, %xmm0
leaq 3(%rsi), %rdx
cmpq $3, %rbp
cmovleq %r13, %rdx
addq $4, %rsi
leaq -4(%rbp), %rdi
addq %rbx, %rdx
vaddsd (%rax,%rdx,8), %xmm0, %xmm0
cmpq $4, %rbp
movl $0, %edx
cmovgq %rsi, %rdx
movq %rdi, %rbp
cmpq %rsi, %r12
jne .LBB0_6
We can see that Numba produce many useless index checks which make the loop highly inefficient. I am not aware of any clean way to fix this issue. This is sad since such an issue is far from being rare in practice. Using a native language like C and C++ solves this issue (since there is no index wrapping in arrays). An unsafe/ugly way would be to use pointers in Numba, but extracting the Numpy data pointer and giving it to Numba seems quite a pain (if even possible).
And here is the fast one:
.LBB0_8:
vaddpd (%r11,%rsi,8), %ymm0, %ymm0
vaddpd 32(%r11,%rsi,8), %ymm1, %ymm1
vaddpd 64(%r11,%rsi,8), %ymm2, %ymm2
vaddpd 96(%r11,%rsi,8), %ymm3, %ymm3
vaddpd 128(%r11,%rsi,8), %ymm0, %ymm0
vaddpd 160(%r11,%rsi,8), %ymm1, %ymm1
vaddpd 192(%r11,%rsi,8), %ymm2, %ymm2
vaddpd 224(%r11,%rsi,8), %ymm3, %ymm3
vaddpd 256(%r11,%rsi,8), %ymm0, %ymm0
vaddpd 288(%r11,%rsi,8), %ymm1, %ymm1
vaddpd 320(%r11,%rsi,8), %ymm2, %ymm2
vaddpd 352(%r11,%rsi,8), %ymm3, %ymm3
vaddpd 384(%r11,%rsi,8), %ymm0, %ymm0
vaddpd 416(%r11,%rsi,8), %ymm1, %ymm1
vaddpd 448(%r11,%rsi,8), %ymm2, %ymm2
vaddpd 480(%r11,%rsi,8), %ymm3, %ymm3
addq $64, %rsi
addq $-4, %rdi
jne .LBB0_8
In this case, the loops is well optimized. In fact, it is nearly optimal for large arrays. For small arrays, like in your example, it is not optimal on some processor like mine. Indeed, AFAIK, the unrolled instructions does not use enough registers so to hide the latency of the FMA unit (this is because LLVM generates a sub-optimal code internally). A lower-level native code is likely required to fix this (at least, there is no simple way to fix this in Numba).
Update
Thank to this link provided by @max9111, the slow code can be optimized by using unsigned integers. This trick drastically improves the execution time. Here is the modified code:
@njit(fastmath=True, cache=True)
def sum_array_faster(arr):
s = 0
for i in range(np.uint64(arr.shape[0])):
for j in range(np.uint64(arr.shape[1])):
s += arr[i, j]
return s
Here is the performance on an Intel Xeon W-2255 processor:
slow: 9.66 µs
fastest: 1.13 µs
fast: 1.14 µs
Theoretical optimal: 0.30-0.35 µs
The workaround of replacing opt=0
by opt=2
(thanks to @max911 again) does not gives great results on my machine:
slow: 2.12 µs
fastest: 2.17 µs
fast: 2.09 µs
Not to mention the compilation time is also slightly bigger.
A faster implementation can be implemented so to better hide the latency of the the FMA instruction:
@njit(fastmath=True, cache=True)
def sum_array_fastest(arr):
s0, s1 = 0, 0
for i in range(arr.shape[1]):
s0 += arr[0, i]
s1 += arr[1, i]
return s0 + s1
This one takes 1.08 µs. It is better.
There is still two limiting factors with the generated Numba code:
- the overhead of Numba is significant compared to the (short) execution time: 250-300 ns
- Numba does not make use of AVX-512 available on my machine (zmm registers are twice bigger than AVX ones).
Note that the assembly code can be extracted using the method inspect_asm
of Numba functions.