Fastest way to add matrices of different shapes in Python/Numba

Question:

I want to "add" two matrices, a matrix a with shape (K,T) and a matrix b of shape (K,N), to result in a matrix of shape (K,T,N)

The following works ok:


import numpy as np 
from numba import njit

@njit
def add_matrices(a, b):
    K, T, N = a.shape[0], a.shape[1], b.shape[1]
    result_matrix = np.empty((K, T, N))
    
    for k in range(K):
        for t in range(T):
            for n in range(N):
                result_matrix[k, t, n] = a[k, t] + b[k, n]
    
    return result_matrix


K = 10
T = 11
N = 12
a = np.ones((K,T))
b = np.ones((K,N))

result = add_matrices(a, b)


Is there a faster (vectorized?) way to do it that doesn’t require the for loops, which I think is slowing down the function, especially for larger values of K, T,N?

Asked By: user1887919

||

Answers:

Use broadcasting.

a[:,:,None] + b[:,None,:]

This makes a appear to have size [K, T, 1], and b to have size [K, 1, N]. Numpy knows how to add these two together.

Answered By: Frank Yellin

For me, the accepted answer is slower than the original code, even if combined with numba.njit (combining with nit actually makes it slower for large K, T, N). If K, T, N are large then you can speed up your original code with numba.prange:

import numba as nb

@nb.njit(parallel=True)
def add_matrices_prange(a, b):
    assert a.ndim == 2
    assert b.ndim == 2
    assert a.shape[0] == b.shape[0]
    K, T = a.shape
    N = b.shape[1]
    result_matrix = np.empty((K, T, N))
    for k in nb.prange(K):
        for t in range(T):
            for n in range(N):
                result_matrix[k, t, n] = a[k, t] + b[k, n]
    return result_matrix

Timings:

def yellin(a, b):
    return a[:,:,None] + b[:,None,:]

yellin_njit = nb.njit(yellin) # requires numba.__version__ >= 0.58

@nb.njit
def yellin_reshape_njit(a, b): # compatible with numba.__version__ < 0.58
    K, T = a.shape
    _, N = b.shape
    return a.reshape(K, T, 1) + b.reshape(K, 1, N)

K, T, N = 10, 11, 12
# K, T, N = 300, 400, 500

rng = np.random.default_rng()
a = rng.random((K,T))
b = rng.random((K,N))

result = add_matrices(a, b)

assert np.allclose(result, yellin(a, b))
assert np.allclose(result, yellin_njit(a, b))
assert np.allclose(result, yellin_reshape_njit(a, b))
assert np.allclose(result, add_matrices_prange(a, b))

%timeit -n 1000 add_matrices(a, b)
%timeit -n 1000 yellin(a, b)
%timeit -n 1000 yellin_njit(a, b)
%timeit -n 1000 yellin_reshape_njit(a, b)
%timeit -n 1000 add_matrices_prange(a, b)

Results:

K, T, N = 10, 11, 12

751 ns ± 3.75 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
2.17 µs ± 93 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.27 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.91 µs ± 39.3 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
11.6 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # slow for small K, T, N

K, T, N = 300, 400, 500 %timeit -n 10

45.8 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
57.6 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
60.9 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # njit made this slower than pure numpy
85.2 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # also slower than pure numpy
19.9 ms ± 311 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # fastest for large arrays
Answered By: Nin17