How is Numba faster than NumPy for matrix multiplication with integers?

Question:

I was comparing parallel matrix multiplication with numba and matrix multiplication with numpy when I noticed that numpy isn’t as fast with integers (int32).

import numpy as np
from numba import njit, prange

@njit()
def matrix_multiplication(A, B):
  m, n = A.shape
  _, p = B.shape
  C = np.zeros((m, p))
  for i in range(m):
    for j in range(n):
      for k in range(p):
        C[i, k] += A[i, j] * B[j, k]
  return C

@njit(parallel=True, fastmath=True)
def matrix_multiplication_parallel(A, B):
  m, n = A.shape
  _, p = B.shape
  C = np.zeros((m, p))
  for i in prange(m):
    for j in range(n):
      for k in range(p):
        C[i, k] += A[i, j] * B[j, k]
  return C

m = 100
n = 1000
p = 1500
A = np.random.randn(m, n)
B = np.random.randn(n, p)
A2 = np.random.randint(1, 100, size=(m, n))
B2 = np.random.randint(1, 100, size=(n, p))
A3 = np.ones((m, n))
B3 = np.ones((n, p))

# compile function
matrix_multiplication(A, B)
matrix_multiplication_parallel(A, B)

print('normal')
%timeit matrix_multiplication(A, B)
%timeit matrix_multiplication(A2, B2)
%timeit matrix_multiplication(A3, B3)
print('parallel')
%timeit matrix_multiplication_parallel(A, B)
%timeit matrix_multiplication_parallel(A2, B2)
%timeit matrix_multiplication_parallel(A3, B3)
print('numpy')
%timeit A @ B
%timeit A2 @ B2
%timeit A3 @ B3
normal
1.51 s ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.56 s* ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.5 s ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
333 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
313 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
31.2 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.99 s* ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)**
28.4 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I found this answer explaining that numpy doesn’t use BLAS for integers.

From what I understand, both numpy and numba make use of vectorization. I wonder what could be different in the implementations for a relatively consistent 25% increase in performance.

I tried reversing the order of operations in case less CPU resources were available towards the end. I made sure to not do anything while the program was running.

numpy
35.1 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.97 s* ± 44.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
32 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
normal
1.48 s ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.46 s* ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.47 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
379 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
461 ms ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
381 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Trying the method in the answer doesn’t really help.

import inspect
inspect.getmodule(matrix_multiplication)
<module '__main__'>

I tried it on Google Colab.

normal
2.28 s ± 407 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.7 s* ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.6 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
1.33 s ± 315 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.66 s ± 425 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.34 s ± 327 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
64.9 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*2.14 s* ± 477 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
64.1 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is possible to print the generated code, but I don’t know how it can be compared to the numpy code.

for v, k in matrix_multiplication.inspect_llvm().items():
  print(v, k)

Going to the definition of np.matmul leads to matmul: _GUFunc_Nin2_Nout1[L['matmul'], L[19], None] in "…/site-packages/numpy/_init_.pyi".
I think this is the C method being called because of the name "no BLAS". The code seems equivalent to mine, except for additional if statements.

For small arrays m = n = p = 10, numpy is faster.

normal
6.6 µs ± 99.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.72 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.57 µs ± 62.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
parallel
63.5 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
64.5 µs ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
63.3 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
numpy
1.94 µs ± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.53 µs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.91 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

m=10000 instead of 1000

normal
14.4 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.3 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.7 s ± 538 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
3.34 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.46 s ± 78.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
334 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
19.4 s ± 655 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
248 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Edit:
Reported the issue

Asked By: BPDev

||

Answers:

You are comparing two different loop patterns. The pattern equivalent to the Numpy implementation will be like the following.

@njit()
def matrix_multiplication_slow(A, B):
  m, n = A.shape
  _, p = B.shape
  C = np.zeros((m, p))
  for i in range(m):
    for k in range(p):
      for j in range(n):
        C[i, k] += A[i, j] * B[j, k]
  return C

The above matrix_multiplication_slow() is slower than the original matrix_multiplication(), because reading the B[j, k] values iterating the j causes much more cache misses.

So, the current Numpy implementation is not cache friendly. It would be good to report this on here.

Answered By: relent95
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.