Python Start ThreadPoolExecutors in multiprocesses, for a performance increase

Question:

I am just doing simple I/O tasks and want to improve the performance of my program, using 1000 threads (which is important cause I want to run a high number of tasks in the same time, and MultiProcessingPool isn’t doing the job obv if I only have 8 cores I can only run 8 tasks) just takes too long to start them up, the CLI seems to freeze and after 2-3 minutes the tasks finally start. So I want to spread them across the cores in multiprocesses to utilize more of the power of my machine.

so my current code looks like this (the real runTask method is way more complex not just a print and the profileTasks list has more data than just a singe string in it):

from concurrent.futures import ThreadPoolExecutor, as_completed

class ThreadingxMultiprocessing():
    
    def __init__(self) -> None:
        
        profileTasks = ["TEST1",
                        "TEST2",
                        "TEST3",
                        "TEST4",
                        "TEST5",
                        "TEST6",
                        "TEST7",
                        "TEST8",
                        "TEST9",
                        "TEST10",
                        "TEST11",
                        "TEST12",
                        "TEST13",
                        "TEST14",
                        "TEST15",
                        "TEST16",
                        "TEST17",
                        "TEST18",
                        "TEST19",
                        "TEST20",
                        "TEST21",
                        "TEST22",
                        "TEST23",
                        "TEST24",
                        "... and some more to get to 1k profiles",]
        
        self.threads=1000
        
        while True:
                        
            with ThreadPoolExecutor(max_workers=self.threads) as executor:
                for index, profile in enumerate(profileTasks):
                    
                    executor.submit(
                        self.runTask, index, profile
                    )

            
            break
     
    def runTask(self, index, profile): 
        print(index,profile)

ThreadingxMultiprocessing()

I thought about something like this, dividing the threads by the amount of CPU cores you have and spread them equally on them:

from concurrent.futures import ThreadPoolExecutor, as_completed
import multiprocessing
import math
number_of_cpucores = multiprocessing.cpu_count()

class ThreadingxMultiprocessing():
    
    def __init__(self) -> None:
        
        profileTasks = ["TEST1",
                        "TEST2",
                        "TEST3",
                        "TEST4",
                        "TEST5",
                        "TEST6",
                        "TEST7",
                        "TEST8",
                        "TEST9",
                        "TEST10",
                        "TEST11",
                        "TEST12",
                        "TEST13",
                        "TEST14",
                        "TEST15",
                        "TEST16",
                        "TEST17",
                        "TEST18",
                        "TEST19",
                        "TEST20",
                        "TEST21",
                        "TEST22",
                        "TEST23",
                        "TEST24",
                        "... and some more to get to 1k profiles"]
        
        self.threads=1000
        #round them to get an integer datatype
        threads_in_each_process = math.ceil(float(self.threads)/ float(number_of_cpucores))
        
        #-> and then starting the thread pools e.g. with 125 threads each if you have 8 cores
        multiprocessing.Process()
        

    def runTask(self, index, profile): 
        print(index,profile)

ThreadingxMultiprocessing()

But i really dont
know how to set this up, maybe anyone of you has an idea?

Asked By: realsuspection

||

Answers:

you can obviously create a function that each process will run that will create the threadpool and submit work to it, you just need to split the work into equal parts using a custom function.

from concurrent.futures import ThreadPoolExecutor, as_completed, ProcessPoolExecutor


def runTask(index, profile):
    print(index, profile)


def process_function(work_and_workers):
    work_list, workers = work_and_workers
    work_to_do = []
    with ThreadPoolExecutor(max_workers=workers) as pool:
        for element in work_list:
            work_to_do.append(pool.submit(runTask, *element))
        for element in work_to_do:
            element.result()

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

class ThreadingxMultiprocessing():

    def __init__(self) -> None:

        profileTasks = [f"Test{x}" for x in range(1000)]

        self.cores = 8
        self.threads_per_worker = 125
        self.chunk_size = 125
        work_to_do = []
        with ProcessPoolExecutor(max_workers=self.cores) as executor:
            for index, profile in enumerate(profileTasks):
                work_to_do.append((index,profile))
            executor.map(process_function,
                         ((x, self.threads_per_worker) for x in chunks(work_to_do, self.chunk_size)))


if __name__ == "__main__":
    ThreadingxMultiprocessing()

this is the worst code anyone can write to parallelize work blindly, even if the work was exactly equal sized, you will get a much poorer performance than a simple processpool with 16 workers, depending on the work.

the biggest 2 problems here are balancing the work across processes and to send results back to the main process, a queue will be useful in this but balancing the work is too much work on most systems, because this "equal sized work" isn’t going to be distributed equally to your physical cores.

even if the work was balanced and results were returned to the main process, running 1000 threads on an 8 core machine will be slow because of the constant context-switching, also IO isn’t usually made to handle 1000 concurrent hits and might crash or slow down, so while this "parallelizes" the work, its’ the same as getting 1000 people to bake a small cake for a kid … it’s not going to be pretty.

you should probably look into other parallelization mechanisms such as AsyncIO or reduce the threadcount, as throwing "parallel" at the problem can get your code slower, not faster.

Answered By: Ahmed AEK