I need help to speed up of a Python for-loop with huge amount of calculations

Question:

I am working on a pice of Python software that requires to run a huge amount of calculations. I am talking about up to hundred of millions of calculations or more (n in below code can be 100000 or more). I have realized that Python is not the optimal software for this work but I have no experience with C or C++. Is there a way to speed up the below code in Python or do I need to introduce C or C++? If I need to introduce C or C++, any suggestion for how to embed this in a Python script?

import math
import random

a = []
b = []
c = []
x1 = []
x2 = []
y1 = []
y2 = []
tresh = 30 # int or float

n=1000 # n can be up to 100000 or even more
for i in range(n):
    x1.append(random.randint(0, 100))
    x2.append(random.randint(0, 100))
    y1.append(random.randint(0, 100))
    y2.append(random.randint(0, 100))

def calc():
    x1_len = len(x1)
    y1_len = len(y1)

    for n in range(x1_len):
        for m in range(y1_len):
            d = math.sqrt((abs(y1[m] - x1[n])) ** 2 + (abs(y2[m] - x2[n])) ** 2)

            if d >= tresh and d <= tresh:
                a.append((y1[m] + x1[n]) / 2)
                b.append((y2[m] + x2[n]) / 2)
                c.append(d)

    return a,b,c

calc()

With my current Python knowledge and experience I don’t know how to optimize the code further. I have reviewed a lot of for-loop related questions but not found anything has helped me.

Asked By: Per Helge Semb

||

Answers:

Can you use external Python modules like numpy? It is a fundamental package for scientific numerical computing in Python. It will speed everything up for sure for you.

Other thing – you could generate your random numbers once, then use extend instead of append. It will also shave some computation time.

Answered By: ljaniec

a rather simple solution is to use numba to compile it to machine code.

import math
import numpy as np
from numba import njit


n = 10_000  # n can be up to 100000 or even more
x1 = np.random.randint(0, 100, (n,), dtype=np.int64)
x2 = np.random.randint(0, 100, (n,), dtype=np.int64)
y1 = np.random.randint(0, 100, (n,), dtype=np.int64)
y2 = np.random.randint(0, 100, (n,), dtype=np.int64)
tresh = 30  # int or float


@njit
def calc(x1, x2, y1, y2, tresh):
    a = []
    b = []
    c = []
    x1_len = len(x1)
    y1_len = len(y1)

    for n in range(x1_len):
        for m in range(y1_len):
            d = math.sqrt((abs(y1[m] - x1[n])) ** 2 + (abs(y2[m] - x2[n])) ** 2)

            if d <= tresh:
                a.append((y1[m] + x1[n]) / 2)
                b.append((y2[m] + x2[n]) / 2)
                c.append(d)

    return a,b,c

calc(x1,x2,y1,y2, tresh)  # warmup for njit

import time
t1 = time.time()
calc(x1, x2, y1, y2, tresh)
t2 = time.time()
print(f"took {t2-t1} seconds")

this takes only 3 seconds for 10_000 entires, if you’d like more performance than that then, while multithreading is possible for extra speedup, it’s not simple, as you need python to be the one creating the threads, the current numba multithreading API won’t manage this properly (because you cannot tell numba to use a serpate list for each thread)

Answered By: Ahmed AEK

If you can’t use numba (which you won’t be able to do if you’re on the latest Python version) then you could consider multiprocessing as follows:

from math import sqrt
from random import randint
from timer import *
from concurrent.futures import ProcessPoolExecutor as PPE

from_to = 0, 100
tresh = 30.0
N = 10_000

x1 = [randint(*from_to) for _ in range(N)]
x2 = [randint(*from_to) for _ in range(N)]
y1 = [randint(*from_to) for _ in range(N)]
y2 = [randint(*from_to) for _ in range(N)]

def subcalc(_x1, _x2, y1, y2):
    a, b, c = [], [], []
    for _y1, _y2 in zip(y1, y2):
        j = abs(_y1 - _x1) ** 2
        k = abs(_y2 - _x2) ** 2
        if (d := sqrt(j + k)) <= tresh:
            a.append((_y1 + _x1) / 2)
            b.append((_y2 + _x2) / 2)
            c.append(d)
    return a, b, c

@timer
def calc():
    a, b, c = [], [], []
    futures = []
    with PPE(9) as executor:
        for _x1, _x2 in zip(x1, x2):
            futures.append(executor.submit(subcalc, _x1, _x2, y1, y2))
        for future in futures:
            _a, _b, _c = future.result()
            a.extend(_a)
            b.extend(_b)
            c.extend(_c)
    return a, b, c

if __name__ == '__main__':
    calc()

Notes:

This is how the timer decorator is implemented:

from timeit import default_timer

__all__ = ['timer']
__name__ = 'timer'

def timer(func):
    def wrapper(*args, **kwargs):
        start = default_timer()
        result = func(*args, **kwargs)
        print(func.__name__, f'Duration = {default_timer()-start:.6f}')
        return result
    return wrapper

Also note the max_workers value for the process pool executor. My machine uses a 10-core Intel Xeon so just use 9 leaving one spare for the main program (and anything that happens to be running).

With N == 10_000 this runs in 8.2s

Answered By: Pingu
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.