What is the best way to compute the trace of a matrix product in numpy?
Question:
If I have numpy arrays A
and B
, then I can compute the trace of their matrix product with:
tr = numpy.linalg.trace(A.dot(B))
However, the matrix multiplication A.dot(B)
unnecessarily computes all of the off-diagonal entries in the matrix product, when only the diagonal elements are used in the trace. Instead, I could do something like:
tr = 0.0
for i in range(n):
tr += A[i, :].dot(B[:, i])
but this performs the loop in Python code and isn’t as obvious as numpy.linalg.trace
.
Is there a better way to compute the trace of a matrix product of numpy arrays? What is the fastest or most idiomatic way to do this?
Answers:
From wikipedia you can calculate the trace using the hadamard product (element-wise multiplication):
# Tr(A.B)
tr = (A*B.T).sum()
I think this takes less computation than doing numpy.trace(A.dot(B))
.
Edit:
Ran some timers. This way is much faster than using numpy.trace
.
In [37]: timeit("np.trace(A.dot(B))", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[38]: 8.6434469223022461
In [39]: timeit("(A*B.T).sum()", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[40]: 0.5516049861907959
You can improve on @Bill’s solution by reducing intermediate storage to the diagonal elements only:
from numpy.core.umath_tests import inner1d
m, n = 1000, 500
a = np.random.rand(m, n)
b = np.random.rand(n, m)
# They all should give the same result
print np.trace(a.dot(b))
print np.sum(a*b.T)
print np.sum(inner1d(a, b.T))
%timeit np.trace(a.dot(b))
10 loops, best of 3: 34.7 ms per loop
%timeit np.sum(a*b.T)
100 loops, best of 3: 4.85 ms per loop
%timeit np.sum(inner1d(a, b.T))
1000 loops, best of 3: 1.83 ms per loop
Another option is to use np.einsum
and have no explicit intermediate storage at all:
# Will print the same as the others:
print np.einsum('ij,ji->', a, b)
On my system it runs slightly slower than using inner1d
, but it may not hold for all systems, see this question:
%timeit np.einsum('ij,ji->', a, b)
100 loops, best of 3: 1.91 ms per loop
Note that one slight variant is to take the dot product of the vec
torized matrices. In python, vectorization is done using .flatten('F')
. It’s slightly slower than taking the sum of the Hadamard product, on my computer, so it’s a worse solution than wflynny’s , but I think it’s kind of interesting, since it can be more intuitive, in some situations, in my opinion. For example, personally I find that for the matrix normal distribution, the vectorized solution is easier for me to understand.
Speed comparison, on my system:
import numpy as np
import time
N = 1000
np.random.seed(123)
A = np.random.randn(N, N)
B = np.random.randn(N, N)
tart = time.time()
for i in range(10):
C = np.trace(A.dot(B))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = A.flatten('F').dot(B.T.flatten('F'))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A.T * B).sum()
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A * B.T).sum()
print(time.time() - start, C)
Result:
6.246593236923218 -629.370798672
0.06539678573608398 -629.370798672
0.057890892028808594 -629.370798672
0.05709719657897949 -629.370798672
I’ve found a new solution based on @wflynny and @Jaime’s answers. It’s just directly computing the inner product between "flattened" versions of two matrices.
np.dot(a.ravel(), b.ravel())
It turned out to be about 25 times faster than @wflynny’s solution and 13 times faster than @Jaime’s one.
In [1]: from numpy.core.umath_tests import inner1d
...:
...: m, n = 1000, 500
...:
...: a = np.random.rand(m, n)
...: b = np.random.rand(n, m)
...:
In [2]: %timeit np.trace(a.dot(b))
5.73 ms ± 406 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: %timeit np.sum(a*b.T)
1.12 ms ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %timeit np.sum(inner1d(a, b.T))
594 µs ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit np.einsum('ij,ji->', a, b)
587 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit np.dot(a.ravel(), b.ravel())
43.3 µs ± 2.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
However, I’ve found that it yields a little bit different result (0.09% relative error) from other methods, which I don’t see why.
In [7]: print(np.trace(a.dot(b)))
...: print(np.sum(a*b.T))
...: print(np.sum(inner1d(a, b.T)))
...: print(np.einsum('ij,ji->', a, b))
...: print(np.dot(a.ravel(), b.ravel()))
...:
124821.25304563068
124821.25304563066
124821.25304563067
124821.2530456306
124935.68288955501
If I have numpy arrays A
and B
, then I can compute the trace of their matrix product with:
tr = numpy.linalg.trace(A.dot(B))
However, the matrix multiplication A.dot(B)
unnecessarily computes all of the off-diagonal entries in the matrix product, when only the diagonal elements are used in the trace. Instead, I could do something like:
tr = 0.0
for i in range(n):
tr += A[i, :].dot(B[:, i])
but this performs the loop in Python code and isn’t as obvious as numpy.linalg.trace
.
Is there a better way to compute the trace of a matrix product of numpy arrays? What is the fastest or most idiomatic way to do this?
From wikipedia you can calculate the trace using the hadamard product (element-wise multiplication):
# Tr(A.B)
tr = (A*B.T).sum()
I think this takes less computation than doing numpy.trace(A.dot(B))
.
Edit:
Ran some timers. This way is much faster than using numpy.trace
.
In [37]: timeit("np.trace(A.dot(B))", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[38]: 8.6434469223022461
In [39]: timeit("(A*B.T).sum()", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[40]: 0.5516049861907959
You can improve on @Bill’s solution by reducing intermediate storage to the diagonal elements only:
from numpy.core.umath_tests import inner1d
m, n = 1000, 500
a = np.random.rand(m, n)
b = np.random.rand(n, m)
# They all should give the same result
print np.trace(a.dot(b))
print np.sum(a*b.T)
print np.sum(inner1d(a, b.T))
%timeit np.trace(a.dot(b))
10 loops, best of 3: 34.7 ms per loop
%timeit np.sum(a*b.T)
100 loops, best of 3: 4.85 ms per loop
%timeit np.sum(inner1d(a, b.T))
1000 loops, best of 3: 1.83 ms per loop
Another option is to use np.einsum
and have no explicit intermediate storage at all:
# Will print the same as the others:
print np.einsum('ij,ji->', a, b)
On my system it runs slightly slower than using inner1d
, but it may not hold for all systems, see this question:
%timeit np.einsum('ij,ji->', a, b)
100 loops, best of 3: 1.91 ms per loop
Note that one slight variant is to take the dot product of the vec
torized matrices. In python, vectorization is done using .flatten('F')
. It’s slightly slower than taking the sum of the Hadamard product, on my computer, so it’s a worse solution than wflynny’s , but I think it’s kind of interesting, since it can be more intuitive, in some situations, in my opinion. For example, personally I find that for the matrix normal distribution, the vectorized solution is easier for me to understand.
Speed comparison, on my system:
import numpy as np
import time
N = 1000
np.random.seed(123)
A = np.random.randn(N, N)
B = np.random.randn(N, N)
tart = time.time()
for i in range(10):
C = np.trace(A.dot(B))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = A.flatten('F').dot(B.T.flatten('F'))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A.T * B).sum()
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A * B.T).sum()
print(time.time() - start, C)
Result:
6.246593236923218 -629.370798672
0.06539678573608398 -629.370798672
0.057890892028808594 -629.370798672
0.05709719657897949 -629.370798672
I’ve found a new solution based on @wflynny and @Jaime’s answers. It’s just directly computing the inner product between "flattened" versions of two matrices.
np.dot(a.ravel(), b.ravel())
It turned out to be about 25 times faster than @wflynny’s solution and 13 times faster than @Jaime’s one.
In [1]: from numpy.core.umath_tests import inner1d
...:
...: m, n = 1000, 500
...:
...: a = np.random.rand(m, n)
...: b = np.random.rand(n, m)
...:
In [2]: %timeit np.trace(a.dot(b))
5.73 ms ± 406 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: %timeit np.sum(a*b.T)
1.12 ms ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %timeit np.sum(inner1d(a, b.T))
594 µs ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit np.einsum('ij,ji->', a, b)
587 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: %timeit np.dot(a.ravel(), b.ravel())
43.3 µs ± 2.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
However, I’ve found that it yields a little bit different result (0.09% relative error) from other methods, which I don’t see why.
In [7]: print(np.trace(a.dot(b)))
...: print(np.sum(a*b.T))
...: print(np.sum(inner1d(a, b.T)))
...: print(np.einsum('ij,ji->', a, b))
...: print(np.dot(a.ravel(), b.ravel()))
...:
124821.25304563068
124821.25304563066
124821.25304563067
124821.2530456306
124935.68288955501