What is the best way to compute the trace of a matrix product in numpy?

Question:

If I have numpy arrays A and B, then I can compute the trace of their matrix product with:

tr = numpy.linalg.trace(A.dot(B))

However, the matrix multiplication A.dot(B) unnecessarily computes all of the off-diagonal entries in the matrix product, when only the diagonal elements are used in the trace. Instead, I could do something like:

tr = 0.0
for i in range(n):
    tr += A[i, :].dot(B[:, i])

but this performs the loop in Python code and isn’t as obvious as numpy.linalg.trace.

Is there a better way to compute the trace of a matrix product of numpy arrays? What is the fastest or most idiomatic way to do this?

Asked By: amcnabb

||

Answers:

From wikipedia you can calculate the trace using the hadamard product (element-wise multiplication):

# Tr(A.B)
tr = (A*B.T).sum()

I think this takes less computation than doing numpy.trace(A.dot(B)).

Edit:

Ran some timers. This way is much faster than using numpy.trace.

In [37]: timeit("np.trace(A.dot(B))", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[38]: 8.6434469223022461

In [39]: timeit("(A*B.T).sum()", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[40]: 0.5516049861907959
Answered By: wflynny

You can improve on @Bill’s solution by reducing intermediate storage to the diagonal elements only:

from numpy.core.umath_tests import inner1d

m, n = 1000, 500

a = np.random.rand(m, n)
b = np.random.rand(n, m)

# They all should give the same result
print np.trace(a.dot(b))
print np.sum(a*b.T)
print np.sum(inner1d(a, b.T))

%timeit np.trace(a.dot(b))
10 loops, best of 3: 34.7 ms per loop

%timeit np.sum(a*b.T)
100 loops, best of 3: 4.85 ms per loop

%timeit np.sum(inner1d(a, b.T))
1000 loops, best of 3: 1.83 ms per loop

Another option is to use np.einsum and have no explicit intermediate storage at all:

# Will print the same as the others:
print np.einsum('ij,ji->', a, b)

On my system it runs slightly slower than using inner1d, but it may not hold for all systems, see this question:

%timeit np.einsum('ij,ji->', a, b)
100 loops, best of 3: 1.91 ms per loop
Answered By: Jaime

Note that one slight variant is to take the dot product of the vectorized matrices. In python, vectorization is done using .flatten('F'). It’s slightly slower than taking the sum of the Hadamard product, on my computer, so it’s a worse solution than wflynny’s , but I think it’s kind of interesting, since it can be more intuitive, in some situations, in my opinion. For example, personally I find that for the matrix normal distribution, the vectorized solution is easier for me to understand.

Speed comparison, on my system:

import numpy as np
import time

N = 1000

np.random.seed(123)
A = np.random.randn(N, N)
B = np.random.randn(N, N)

tart = time.time()
for i in range(10):
    C = np.trace(A.dot(B))
print(time.time() - start, C)

start = time.time()
for i in range(10):
    C = A.flatten('F').dot(B.T.flatten('F'))
print(time.time() - start, C)

start = time.time()
for i in range(10):
    C = (A.T * B).sum()
print(time.time() - start, C)

start = time.time()
for i in range(10):
    C = (A * B.T).sum()
print(time.time() - start, C)

Result:

6.246593236923218 -629.370798672
0.06539678573608398 -629.370798672
0.057890892028808594 -629.370798672
0.05709719657897949 -629.370798672
Answered By: Hugh Perkins

I’ve found a new solution based on @wflynny and @Jaime’s answers. It’s just directly computing the inner product between "flattened" versions of two matrices.

np.dot(a.ravel(), b.ravel())

It turned out to be about 25 times faster than @wflynny’s solution and 13 times faster than @Jaime’s one.

In [1]: from numpy.core.umath_tests import inner1d
    ...: 
    ...: m, n = 1000, 500
    ...: 
    ...: a = np.random.rand(m, n)
    ...: b = np.random.rand(n, m)
    ...:

In [2]: %timeit np.trace(a.dot(b))
5.73 ms ± 406 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit np.sum(a*b.T)
1.12 ms ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit np.sum(inner1d(a, b.T))
594 µs ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit np.einsum('ij,ji->', a, b)
587 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit np.dot(a.ravel(), b.ravel())
43.3 µs ± 2.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

However, I’ve found that it yields a little bit different result (0.09% relative error) from other methods, which I don’t see why.

In [7]: print(np.trace(a.dot(b)))
    ...: print(np.sum(a*b.T))
    ...: print(np.sum(inner1d(a, b.T)))
    ...: print(np.einsum('ij,ji->', a, b))
    ...: print(np.dot(a.ravel(), b.ravel()))
    ...: 
124821.25304563068
124821.25304563066
124821.25304563067
124821.2530456306
124935.68288955501
Answered By: Ryota Ushio
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.