Why is the list_comprehension version slower (python)? Are there better ways?

Question:

I have a line of code, and I’m surprised, that the list comprehended version is slower. I want to calculate "average sphere distances".

The steps needed to be done:

  1. choose a node
  2. chose all elements at a distance "d" around it (shell at distance d)
  3. find the shells of radius "d" around all nodes at distance "d"
  4. calculate the average distance of the shells with radius "d" at distance "d"

Im using a sparse matrix storing the adjacency relation between the nodes. As these are Laplacians I have actually removed the diagonal, but it’s irrelevant. I’m using scipy.shortest_path (or now actually scipy.dijstra) to calculate the shortest path between two nodes. The return value is actually an array for node "I" for all elements up to the limit (if added as an argument to dijkstra). Since a use a subset of the elements from the matrix, i will need a dictionary, that can translate between element and index in the list of shortest paths.

So the code:

I’m importing the libraries:

import numpy as np
import time
import random
from random import randint

from scipy.sparse import csr_matrix
from scipy.sparse.csgraph import breadth_first_tree
import scipy as scp
from scipy.stats import uniform
from scipy import io
from scipy import sparse  # package for sparse matrices
from scipy import linalg  # package for linear algebra 
from scipy.sparse import identity
from scipy.sparse import csr_matrix

from scipy.sparse.csgraph import breadth_first_order
from scipy.sparse import csr_matrix

from scipy.sparse.csgraph import shortest_path,dijkstra

Reading in a sparse matrix, with some parameters (nxn):

input_m = "L-4-4.0-0.02-4-2-1.mtx"
L = scp.sparse.csc_matrix(scp.io.mmread(input_m), dtype=int)
ID = identity(np.shape(L)[0], dtype='int8', format='dia')
WA = abs(5*ID - L)

Define the functions, that return shells (all up to distance n), ball containing all elements, and another that gives back only the shell at distance n:

def GetShellBall(WA,n,ind):
    p0 = np.zeros(np.shape(L)[0])
    p0[ind] = 1
    newp = p0
    ball = []
    shell = []
    ball.append(ind)
    shell.append([ind])
    for it in range(n):
        newp = WA@newp
        
        for it2 in np.where(newp)[0]:
            if it2 in ball:
                newp[it2] = 0
            else:
                ball.append(it2)
        shell.append(np.where(newp)[0])
        
    return ball,shell


def GetShellj(WA,n,ind):
    p0 = np.zeros(np.shape(L)[0])
    p0[ind] = 1
    newp = p0
    ball = []
    shell = []
    ball.append(ind)
    shell.append([ind])
    for it in range(n):
        newp = WA@newp
        
        for it2 in np.where(newp)[0]:
            if it2 in ball:
                newp[it2] = 0
            else:
                ball.append(it2)
        shell.append(np.where(newp)[0])
        
    return shell[n]

Create a shell and ball around a node at a given limited distance using the dijkstra algorithm, and the dictionary to associate elements in the ball and indices in spaths (list of shortest paths)

%%time
it = 0
N = 11
ball,shell = GetShellBall(WA,N,it) #shell and ball around it at distance N

dict_ba = dict(zip(ball,arange(len(ball)))) #dictionary: ball_element  - number_i (list in ball)
dict_ab = dict(zip(arange(len(ball)),ball)) #dictionary: num_i (list in ball) - ball element

spaths = np.asarray([dijkstra(csgraph=WA, directed=True,
                                 limit = N,unweighted = True, indices=it,
                                 return_predecessors=False) for it in ball])

CPU times: user 7.61 s, sys: 11.9 ms, total: 7.62 s
Wall time: 7.63 s

NOW comes the "heavy part": calculate the average distances around all nodes at distance d from my original node (so node el_i has a shell_at_dist_i, get shells around all elements of shell_at_dist_i with radius "i" and calculate the average distance around the two shells. Then average over it. For this code the loop version is faster than the vectorized one:*

%%timeit
#for i in range(len(shell)):
DD = []
for i in range(int(floor(N/2))):
    sumd = []
    chosen_paths = spaths[[dict_ba[it] for it in shell[i]]]
    for eli in shell[i]:
        shellj = GetShellj(WA,i,eli)
        
        sumd = [[chp[elj] for elj in shellj] for chp in chosen_paths]
    
    DD.append(mean(sumd))

36.2 ms ± 943 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
#for i in range(len(shell)):
DD = []
for i in range(int(floor(N/2))):
    sumd = []
    chosen_paths = spaths[[dict_ba[it] for it in shell[i]]]
    for eli in shell[i]:
        
        sumd = [[chp[elj] for elj in GetShellj(WA,i,eli)] for chp in chosen_paths]
    
    DD.append(mean(sumd))

422 ms ± 7.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
#for i in range(len(shell)):
DD = []
for i in range(int(floor(N/2))):
    chosen_paths = spaths[[dict_ba[it] for it in shell[i]]]
            
    sumd = [[[chp[elj] for elj in GetShellj(WA,i,eli)] for chp in chosen_paths] for eli in shell[i]]
    
    DD.append(mean(sumd))

439 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
#for i in range(len(shell)):
DD = []
for i in range(int(floor(N/2))):                
    sumd = [
        [
            [
                chp[elj] for elj in GetShellj(WA,i,eli)
            ] for chp in spaths[[dict_ba[it] for it in shell[i]]]
        ] for eli in shell[i]]
    
    DD.append(mean(sumd))

445 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
#for i in range(len(shell)):

DD = [mean([
    [
        [
            chp[elj] for elj in GetShellj(WA,i,eli)
        ] for chp in spaths[[dict_ba[it] for it in shell[i]]]
    ] for eli in shell[i]]) for i in range(int(floor(N/2)))]

440 ms ± 8.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why is the loop version so much faster? I would like to scale it up. It is just one starting point from a 2300×2300 matrix, but I need to go much higher (~half million x half million matrices).

Are there maybe methods to accelerate it even more? (apart from using C instead of python….)

EDIT:
Solved: bug.

Question remains: is there a faster way?

Asked By: Kregnach

||

Answers:

The first version is faster than the rest because you don’t call GetShellj(WA,i,eli) as many times.

If you slap the @cache or @lru_cache decorator on that function, all of the approaches should be approximately as fast.

Answered By: AKX