Faster Startup of Processes Python

Question:

I’m trying to run two functions in Python3 in parallel. They both take about 30ms, and unfortunately, after writing a testing script, I’ve found that the startup-time to get the processes running in the background takes over 100ms which is a pretty high overhead that I would like to avoid. Is anybody aware of a faster way to run functions concurrently in Python3 (having a lower overhead — ideally in the ones or tens of milliseconds) where I can still get the results of their functions in the main process. Any guidance on this would be appreciated, and if there is any information that I can provide, please let me know.

For hardware information, I’m running this on a 2019 MacBook Pro with Python 3.10.9 with a 2GHz Quad-Core Intel Core i5.

I’ve provided the script that I’ve written below as well as the output that I typically get from it.

import multiprocessing as mp
import time
import numpy as np

def t(s):
    return (time.perf_counter() - s) * 1000

def run0(s):
    print(f"Time to reach run0: {t(s):.2f}ms")

    time.sleep(0.03)
    return np.ones((1,4))

def run1(s):
    print(f"Time to reach run1: {t(s):.2f}ms")

    time.sleep(0.03)
    return np.zeros((1,5))

def main():
    s = time.perf_counter()

    with mp.Pool(processes=2) as p:
        print(f"Time to init pool: {t(s):.2f}ms")

        f0 = p.apply_async(run0, args=(time.perf_counter(),))
        f1 = p.apply_async(run1, args=(time.perf_counter(),))

        r0 = f0.get()
        r1 = f1.get()
        print(r0, r1)

    print(f"Time to run end-to-end: {t(s):.2f}ms")

if __name__ == "__main__":
    main()

Below is the output that I typically get from running the above script

Time to init pool: 33.14ms
Time to reach run0: 198.50ms
Time to reach run1: 212.06ms
[[1. 1. 1. 1.]] [[0. 0. 0. 0. 0.]]
Time to run end-to-end: 287.68ms

Note: I’m looking to decrease the quantities on the 2nd and 3rd line by a factor of 10-20x smaller. I know that that is a lot, and if it is not possible, that is perfectly fine, but I was just wondering if anybody more knowledgable would know any methods. Thanks!

Asked By: dinodeep

||

Answers:

you can switch to python 3.11+ as it has a faster startup time (and faster everything), but as your application grows you will get even slower startup times compared to your toy example.

one option, is to run your application inside a linux docker image so you can use fork to avoid the spawn overhead, (though the COW overhead will still be visible)

the ultimate solution ? don’t write your application in python (or any other language with a VM or a garbage collector), python multiprocessing isn’t made for small fast tasks but for long running tasks, if you need that low startup time then write it in C or C++.

if you have to use python then you should reuse your workers to "absorb" this startup time in a much larger task time.

Answered By: Ahmed AEK

several points to consider:

  • "Time to init pool" is wrong. The child processes haven’t finished starting, only the main process has initiated their startup. Once the workers have actually started, the speed of "Time to reach run" should drop to not include process startup. If you have a long lived pool of workers, you only pay startup cost once.

  • startup cost of the interpreter is often dominated by imports in this case you really only have numpy, and it is used by the target function, so you can’t exactly get rid of it. Another that can be slow is the automatic import of site, but it makes other imports difficult to skip that one.

  • you’re on MacOS, and can switch to using "fork" instead of "spawn" which should be much faster, but fundamentally changes how multiprocessing works in a few ways (and is incompatible with certain OS libraries)

example:

import multiprocessing as mp
import time
# import numpy as np

def run():
    time.sleep(0.03)
    return "whatever"

def main():
    s = time.perf_counter()
    with mp.Pool(processes=1) as p:

        p.apply_async(run).get()
        print(f"first job time: {(time.perf_counter() -s)*1000:.2f}ms")
        #first job 166ms with numpy ; 85ms without ; 45ms on linux (wsl2 ubuntu 20.04) with fork
        s = time.perf_counter()
        
        p.apply_async(run).get()
        print(f"after startup job time: {(time.perf_counter() -s)*1000:.2f}ms")
        #second job about 30ms every time

if __name__ == "__main__":
    main()
Answered By: Aaron
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.