Parallelizing a CPU-bound Python function
Question:
I have a CPU-bound Python function that takes around 15 seconds to run on a standard core. I need to run this function tens of thousands of times. The function input is a dataset around 10kB in size, so data transfer time should be negligible compared to the runtime. The functions do not need to communicate with each other. The return value is a small array.
I do not need to synchronize these functions at all. All I care about is that when one core finishes, it gets delegated a new job.
What is a good framework to start parallelizing this problem with? I would like to be able to run this on my own computers and also Amazon units.
Would Python’s multiprocessing module do the trick? Would I be better off with something other than that?
Answers:
if no communication needed – simplest way is Pool.map. It like map function, but iterations processed in one of child process.
import multiprocessing
pool = multiprocessing.Pool(processes=4)
def fu(chunk):
#your code here
return result
def produce_data(data):
while data:
#you need to split data
yield chunk
result = pool.map(fu,produce_data(data))
# result will be ordered list of results for each chunk
There is few several ways to process data with multiprocessing.
I have a CPU-bound Python function that takes around 15 seconds to run on a standard core. I need to run this function tens of thousands of times. The function input is a dataset around 10kB in size, so data transfer time should be negligible compared to the runtime. The functions do not need to communicate with each other. The return value is a small array.
I do not need to synchronize these functions at all. All I care about is that when one core finishes, it gets delegated a new job.
What is a good framework to start parallelizing this problem with? I would like to be able to run this on my own computers and also Amazon units.
Would Python’s multiprocessing module do the trick? Would I be better off with something other than that?
if no communication needed – simplest way is Pool.map. It like map function, but iterations processed in one of child process.
import multiprocessing
pool = multiprocessing.Pool(processes=4)
def fu(chunk):
#your code here
return result
def produce_data(data):
while data:
#you need to split data
yield chunk
result = pool.map(fu,produce_data(data))
# result will be ordered list of results for each chunk
There is few several ways to process data with multiprocessing.