Parallelizing a CPU-bound Python function

Question:

I have a CPU-bound Python function that takes around 15 seconds to run on a standard core. I need to run this function tens of thousands of times. The function input is a dataset around 10kB in size, so data transfer time should be negligible compared to the runtime. The functions do not need to communicate with each other. The return value is a small array.

I do not need to synchronize these functions at all. All I care about is that when one core finishes, it gets delegated a new job.

What is a good framework to start parallelizing this problem with? I would like to be able to run this on my own computers and also Amazon units.

Would Python’s multiprocessing module do the trick? Would I be better off with something other than that?

Asked By: Cory Walker

||

Answers:

if no communication needed – simplest way is Pool.map. It like map function, but iterations processed in one of child process.

import multiprocessing
pool = multiprocessing.Pool(processes=4)
def fu(chunk):
    #your code here
    return result

def produce_data(data):
    while data:
        #you need to split data
        yield chunk

result = pool.map(fu,produce_data(data))
# result will be ordered list of results for each chunk

There is few several ways to process data with multiprocessing.

Answered By: eri