Fast tensor rotation with NumPy

Question:

At the heart of an application (written in Python and using NumPy) I need to rotate a 4th order tensor. Actually, I need to rotate a lot of tensors many times and this is my bottleneck. My naive implementation (below) involving eight nested loops seems to be quite slow, but I cannot see a way to leverage NumPy’s matrix operations and, hopefully, speed things up. I’ve a feeling I should be using np.tensordot, but I don’t see how.

Mathematically, elements of the rotated tensor, T’, are given by: T’ijkl = Σ gia gjb gkc gld Tabcd with the sum being over the repeated indices on the right hand side. T and Tprime are 3*3*3*3 NumPy arrays and the rotation matrix g is a 3*3 NumPy array. My slow implementation (taking ~0.04 seconds per call) is below.

#!/usr/bin/env python

import numpy as np

def rotT(T, g):
    Tprime = np.zeros((3,3,3,3))
    for i in range(3):
        for j in range(3):
            for k in range(3):
                for l in range(3):
                    for ii in range(3):
                        for jj in range(3):
                            for kk in range(3):
                                for ll in range(3):
                                    gg = g[ii,i]*g[jj,j]*g[kk,k]*g[ll,l]
                                    Tprime[i,j,k,l] = Tprime[i,j,k,l] + 
                                         gg*T[ii,jj,kk,ll]
    return Tprime

if __name__ == "__main__":

    T = np.array([[[[  4.66533067e+01,  5.84985000e-02, -5.37671310e-01],
                    [  5.84985000e-02,  1.56722231e+01,  2.32831900e-02],
                    [ -5.37671310e-01,  2.32831900e-02,  1.33399259e+01]],
                   [[  4.60051700e-02,  1.54658176e+01,  2.19568200e-02],
                    [  1.54658176e+01, -5.18223500e-02, -1.52814920e-01],
                    [  2.19568200e-02, -1.52814920e-01, -2.43874100e-02]],
                   [[ -5.35577630e-01,  1.95558600e-02,  1.31108757e+01],
                    [  1.95558600e-02, -1.51342210e-01, -6.67615000e-03],
                    [  1.31108757e+01, -6.67615000e-03,  6.90486240e-01]]],
                  [[[  4.60051700e-02,  1.54658176e+01,  2.19568200e-02],
                    [  1.54658176e+01, -5.18223500e-02, -1.52814920e-01],
                    [  2.19568200e-02, -1.52814920e-01, -2.43874100e-02]],
                   [[  1.57414726e+01, -3.86167500e-02, -1.55971950e-01],
                    [ -3.86167500e-02,  4.65601977e+01, -3.57741000e-02],
                    [ -1.55971950e-01, -3.57741000e-02,  1.34215636e+01]],
                   [[  2.58256300e-02, -1.49072770e-01, -7.38843000e-03],
                    [ -1.49072770e-01, -3.63410500e-02,  1.32039847e+01],
                    [ -7.38843000e-03,  1.32039847e+01,  1.38172700e-02]]],
                  [[[ -5.35577630e-01,  1.95558600e-02,  1.31108757e+01],
                    [  1.95558600e-02, -1.51342210e-01, -6.67615000e-03],
                    [  1.31108757e+01, -6.67615000e-03,  6.90486240e-01]],
                   [[  2.58256300e-02, -1.49072770e-01, -7.38843000e-03],
                    [ -1.49072770e-01, -3.63410500e-02,  1.32039847e+01],
                    [ -7.38843000e-03,  1.32039847e+01,  1.38172700e-02]],
                   [[  1.33639532e+01, -1.26331100e-02,  6.84650400e-01],
                    [ -1.26331100e-02,  1.34222177e+01,  1.67851800e-02],
                    [  6.84650400e-01,  1.67851800e-02,  4.89151396e+01]]]])

    g = np.array([[ 0.79389393,  0.54184237,  0.27593346],
                  [-0.59925749,  0.62028664,  0.50609776],
                  [ 0.10306737, -0.56714313,  0.8171449 ]])

    for i in range(100):
        Tprime = rotT(T,g)

Is there a way to make this go faster? Making the code generalise to other ranks of tensor would be useful, but is less important.

Asked By: Andrew Walker

||

Answers:

Here is how to do it with a single Python loop:

def rotT(T, g):
    Tprime = T
    for i in range(4):
        slices = [None] * 4
        slices[i] = slice(None)
        slices *= 2
        Tprime = g[slices].T * Tprime
    return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)

Admittedly, this is a bit hard to grasp at first glance, but it’s quite a bit faster 🙂

Answered By: Sven Marnach

To use tensordot, compute the outer product of the g tensors:

def rotT(T, g):
    gg = np.outer(g, g)
    gggg = np.outer(gg, gg).reshape(4 * g.shape)
    axes = ((0, 2, 4, 6), (0, 1, 2, 3))
    return np.tensordot(gggg, T, axes)

On my system, this is around seven times faster than Sven’s solution. If the g tensor doesn’t change often, you can also cache the gggg tensor. If you do this and turn on some micro-optimizations (inlining the tensordot code, no checks, no generic shapes), you can still make it two times faster:

def rotT(T, gggg):
    return np.dot(gggg.transpose((1, 3, 5, 7, 0, 2, 4, 6)).reshape((81, 81)),
                  T.reshape(81, 1)).reshape((3, 3, 3, 3))

Results of timeit on my home laptop (500 iterations):

Your original code: 19.471129179
Sven's code: 0.718412876129
My first code: 0.118047952652
My second code: 0.0690279006958

The numbers on my work machine are:

Your original code: 9.77922987938
Sven's code: 0.137110948563
My first code: 0.0569641590118
My second code: 0.0308079719543
Answered By: Philipp

Out of curiosity I’ve compared Cython implementation of a naive code from the question with the numpy code from @Philipp’s answer. Cython code is four times faster on my machine:

#cython: boundscheck=False, wraparound=False
import numpy as np
cimport numpy as np

def rotT(np.ndarray[np.float64_t, ndim=4] T,
         np.ndarray[np.float64_t, ndim=2] g):
    cdef np.ndarray[np.float64_t, ndim=4] Tprime
    cdef Py_ssize_t i, j, k, l, ii, jj, kk, ll
    cdef np.float64_t gg

    Tprime = np.zeros((3,3,3,3), dtype=T.dtype)
    for i in range(3):
        for j in range(3):
            for k in range(3):
                for l in range(3):
                    for ii in range(3):
                        for jj in range(3):
                            for kk in range(3):
                                for ll in range(3):
                                    gg = g[ii,i]*g[jj,j]*g[kk,k]*g[ll,l]
                                    Tprime[i,j,k,l] = Tprime[i,j,k,l] + 
                                         gg*T[ii,jj,kk,ll]
    return Tprime
Answered By: jfs

Thanks to hard work by M. Wiebe, the next version of Numpy (which will probably be 1.6) is going to make this even easier:

>>> Trot = np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)

Philipp’s approach is at the moment 3x faster, though, but perhaps there is some room for improvement. The speed difference is probably mostly due to tensordot being able to unroll the whole operation as a single matrix product that can be passed on to BLAS, and so avoiding much of the overhead associated with small arrays — this is not possible for general Einstein summation, as not all operations that can be expressed in this form resolve to a single matrix product.

Answered By: pv.

I thought I’d contribute a relatively new data point to these benchmarks, using parakeet, one of the numpy-aware JIT compilers that’s sprung up in the past few months. (The other one I’m aware of is numba, but I didn’t test it here.)

After you make it through the somewhat labyrinthine installation process for LLVM, you can decorate many pure-numpy functions to (often) speed up their performance :

import numpy as np
import parakeet

@parakeet.jit
def rotT(T, g):
    # ...

I only tried applying the JIT to Andrew’s code in the original question, but it does pretty well (> 10x speedup) for not having to write any new code whatsoever :

andrew      10 loops, best of 3: 206 msec per loop
andrew_jit  10 loops, best of 3: 13.3 msec per loop
sven        100 loops, best of 3: 2.39 msec per loop
philipp     1000 loops, best of 3: 0.879 msec per loop

For these timings (on my laptop) I ran each function ten times, to give the JIT a chance to identify and optimize the hot code paths.

Interestingly, Sven and Philipp’s suggestions are still orders of magnitude faster !

Answered By: lmjohns3

Prospective Approach and solution code

For memory efficiency and thereafter performance efficiency, we could use tensor matrix-multiplication in steps.

To illustrate the steps involved, let’s use the simplest of the solutions with np.einsum by @pv.

np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)

As seen, we are losing the first dimension from g with tensor-multiplication between its four variants and T.

Let’s do those sum-reductions for tensor matrix multiplications in steps. Let’s start off with the first variant of g and T :

p1 = np.einsum('abcd, ai->bcdi', T, g)

Thus, we end up with a tensor of dimensions as string notation : bcdi. The next steps would involve sum-reducing this tensor against the rest of the three g variants as used in the original einsum implmentation. Hence, the next reduction would be –

p2 = np.einsum('bcdi, bj->cdij', p1, g)

As seen, we have lost the first two dimensions with the string notations : a, b. We continue it for two more steps to get rid of c and d too and would be left with ijkl as the final output, like so –

p3 = np.einsum('cdij, ck->dijk', p2, g)

p4 = np.einsum('dijk, dl->ijkl', p3, g)

Now, we could use np.tensordot for these sum-reductions, which would be much more efficient.

Final implementation

Thus, porting over to np.tensordot, we would have the final implementation like so –

p1 = np.tensordot(T,g,axes=((0),(0)))
p2 = np.tensordot(p1,g,axes=((0),(0)))
p3 = np.tensordot(p2,g,axes=((0),(0)))
out = np.tensordot(p3,g,axes=((0),(0)))

Runtime test

Let’s test out all the NumPy based approaches posted across other posts to solve the problem on performance.

Approaches as functions –

def rotT_Philipp(T, g):  # @Philipp's soln
    gg = np.outer(g, g)
    gggg = np.outer(gg, gg).reshape(4 * g.shape)
    axes = ((0, 2, 4, 6), (0, 1, 2, 3))
    return np.tensordot(gggg, T, axes)

def rotT_Sven(T, g):    # @Sven Marnach's soln
    Tprime = T
    for i in range(4):
        slices = [None] * 4
        slices[i] = slice(None)
        slices *= 2
        Tprime = g[slices].T * Tprime
    return Tprime.sum(-1).sum(-1).sum(-1).sum(-1)    

def rotT_pv(T, g):     # @pv.'s soln
    return np.einsum('ai,bj,ck,dl,abcd->ijkl', g, g, g, g, T)

def rotT_Divakar(T,g): # Posted in this post
    p1 = np.tensordot(T,g,axes=((0),(0)))
    p2 = np.tensordot(p1,g,axes=((0),(0)))
    p3 = np.tensordot(p2,g,axes=((0),(0)))
    p4 = np.tensordot(p3,g,axes=((0),(0)))
    return p4

Timings with the original dataset sizes –

In [304]: # Setup inputs 
     ...: T = np.random.rand(3,3,3,3)
     ...: g = np.random.rand(3,3)
     ...: 

In [305]: %timeit rotT(T, g)
     ...: %timeit rotT_pv(T, g)
     ...: %timeit rotT_Sven(T, g)
     ...: %timeit rotT_Philipp(T, g)
     ...: %timeit rotT_Divakar(T, g)
     ...: 
100 loops, best of 3: 6.51 ms per loop
1000 loops, best of 3: 247 µs per loop
10000 loops, best of 3: 137 µs per loop
10000 loops, best of 3: 41.6 µs per loop
10000 loops, best of 3: 28.3 µs per loop

In [306]: 6510.0/28.3 # Speedup with the proposed soln over original code
Out[306]: 230.03533568904592

As discussed at the start of this post, we are trying to achieve memory efficiency and hence performance boost with it. Let’s test that out as we increase the dataset sizes –

In [307]: # Setup inputs 
     ...: T = np.random.rand(5,5,5,5)
     ...: g = np.random.rand(5,5)
     ...: 

In [308]: %timeit rotT(T, g)
     ...: %timeit rotT_pv(T, g)
     ...: %timeit rotT_Sven(T, g)
     ...: %timeit rotT_Philipp(T, g)
     ...: %timeit rotT_Divakar(T, g)
     ...: 
100 loops, best of 3: 6.54 ms per loop
100 loops, best of 3: 7.17 ms per loop
100 loops, best of 3: 2.7 ms per loop
1000 loops, best of 3: 1.47 ms per loop
10000 loops, best of 3: 39.9 µs per loop
Answered By: Divakar

Not a new answer, as all the previous ones deal well with the question. More like a comment, but I post it as an answer to have some space for the code.

While all answers do reproduce the result of the original post, I am pretty sure that the code provided in the original post is wrong. Looking at the formula T’ijkl = Σ gia gjb gkc gld Tabcd, which I believe is correct, the indices for g that are varied in the calculation of each entry of T’ are a, b, c & d. However, in the original provided code, the indices used to access the values of g in the calculation of gg are swapped with regard to the formula. Hence, I believe the following code actually provides the correct implementation of the formula:

def rotT(T, g):
    Tprime = np.zeros((3, 3, 3, 3))
    for i in range(3):
        for j in range(3):
            for k in range(3):
                for l in range(3):
                    for a in range(3):
                        for b in range(3):
                            for c in range(3):
                                for d in range(3):
                                    Tprime[i, j, k, l] += 
                                        g[i, a] * g[j, b] * 
                                        g[k, c] * g[l, d] * T[a, b, c, d]

The equivalent, but faster, calls to einsum and tensordot update to:

Tprime = np.tensordot(g, np.tensordot(g, np.tensordot(
    g, np.tensordot(g, T, (1, 3)), (1, 3)), (1, 3)), (1, 3))
Tprime = np.einsum('ia, jb, kc, ld, abcd->ijkl', g, g, g, g, T)

Additionally, using @jit(nopython=True) from numba on the naive loops function is five times faster than using numpy.tensordot on my machine.

Answered By: Patol75