Replace pickle in Python multiprocessing lib
Question:
I need to execute the code below (simplified version of my real code base in Python 3.5):
import multiprocessing
def forever(do_something=None):
while True:
do_something()
p = multiprocessing.Process(target=forever, args=(lambda: print("do something"),))
p.start()
In order to create the new process Python need to pickle the function and the lambda passed as target.
Unofrtunately pickle cannot serialize lambdas and the output is like this:
_pickle.PicklingError: Can't pickle <function <lambda> at 0x00C0D4B0>: attribute lookup <lambda> on __main__ failed
I discoverd cloudpickle which can serialize and deserialize lambdas and closures, using the same interface of pickle.
How can I force the Python multiprocessing module to use cloudpickle instead of pickle?
Clearly hacking the code of the standard lib multiprocessing is not an option!
Thanks
Charlie
Answers:
Try multiprocess
. It’s a fork of multiprocessing
that uses the dill
serializer instead of pickle
— there are no other changes in the fork.
I’m the author. I encountered the same problem as you several years ago, and ultimately I decided that that hacking the standard library was my only choice, as some of the pickle
code in multiprocessing
is in C++.
>>> import multiprocess as mp
>>> p = mp.Pool()
>>> p.map(lambda x:x**2, range(4))
[0, 1, 4, 9]
>>>
If you’re willing to do a little monkeypatching, a quick fix is to sub out the pickle.Pickler
:
import pickle
import cloudpickle
pickle.Pickler = cloudpickle.Pickler
or, in more recent versions of Python where _pickle.Pickle
is pulled in,
from multiprocessing import reduction
import cloudpickle
reduction.ForkingPickler = cloudpickle.Pickler
Just make sure to do this before importing multiprocessing
. Here’s a full example:
import pickle
import cloudpickle
pickle.Pickler = cloudpickle.Pickler
import multiprocessing as mp
mp.set_start_method('spawn', True)
def procprint(f):
print(f())
if __name__ == '__main__':
p = mp.Process(target=procprint, args=(lambda: "hello",))
p.start()
p.join()
As an aside, you won’t need to do any of this if your start method is fork
, since with forking nothing needs to be pickled in the first place.
I was standing in front of the same problem. So I made a small module which enables pythons mp to eat lambdas.
In case you have a lot different unpickleable things I would also recommend to use dill or cloudpickle.
https://github.com/cloasdata/lambdser
pip install lambdser
I had a similar problem of having to send data to the workers that can be cloudpickled but not normal-pickled.
But I wanted the multiprocessing to work with the normal pickle module for various reasons. I used this pattern:
class FunctionWrapper:
def __init__(self, fn):
self.fn_ser = cloudpickle.dumps(fn)
def __call__(self):
fn = cloudpickle.loads(self.fn_ser)
return fn()
then you can call your lambda or whatever is causing the problem like this:
p = multiprocessing.Process(target=forever, args=FunctionWrapper(lambda: print("do something"),))
The point is that the ‘meaningful’ serialization is happening outside the multiprocessing module with whatever library you want. The pickle in multiprocessing only sees a plain object with some string attributes.
I need to execute the code below (simplified version of my real code base in Python 3.5):
import multiprocessing
def forever(do_something=None):
while True:
do_something()
p = multiprocessing.Process(target=forever, args=(lambda: print("do something"),))
p.start()
In order to create the new process Python need to pickle the function and the lambda passed as target.
Unofrtunately pickle cannot serialize lambdas and the output is like this:
_pickle.PicklingError: Can't pickle <function <lambda> at 0x00C0D4B0>: attribute lookup <lambda> on __main__ failed
I discoverd cloudpickle which can serialize and deserialize lambdas and closures, using the same interface of pickle.
How can I force the Python multiprocessing module to use cloudpickle instead of pickle?
Clearly hacking the code of the standard lib multiprocessing is not an option!
Thanks
Charlie
Try multiprocess
. It’s a fork of multiprocessing
that uses the dill
serializer instead of pickle
— there are no other changes in the fork.
I’m the author. I encountered the same problem as you several years ago, and ultimately I decided that that hacking the standard library was my only choice, as some of the pickle
code in multiprocessing
is in C++.
>>> import multiprocess as mp
>>> p = mp.Pool()
>>> p.map(lambda x:x**2, range(4))
[0, 1, 4, 9]
>>>
If you’re willing to do a little monkeypatching, a quick fix is to sub out the pickle.Pickler
:
import pickle
import cloudpickle
pickle.Pickler = cloudpickle.Pickler
or, in more recent versions of Python where _pickle.Pickle
is pulled in,
from multiprocessing import reduction
import cloudpickle
reduction.ForkingPickler = cloudpickle.Pickler
Just make sure to do this before importing multiprocessing
. Here’s a full example:
import pickle
import cloudpickle
pickle.Pickler = cloudpickle.Pickler
import multiprocessing as mp
mp.set_start_method('spawn', True)
def procprint(f):
print(f())
if __name__ == '__main__':
p = mp.Process(target=procprint, args=(lambda: "hello",))
p.start()
p.join()
As an aside, you won’t need to do any of this if your start method is fork
, since with forking nothing needs to be pickled in the first place.
I was standing in front of the same problem. So I made a small module which enables pythons mp to eat lambdas.
In case you have a lot different unpickleable things I would also recommend to use dill or cloudpickle.
https://github.com/cloasdata/lambdser
pip install lambdser
I had a similar problem of having to send data to the workers that can be cloudpickled but not normal-pickled.
But I wanted the multiprocessing to work with the normal pickle module for various reasons. I used this pattern:
class FunctionWrapper:
def __init__(self, fn):
self.fn_ser = cloudpickle.dumps(fn)
def __call__(self):
fn = cloudpickle.loads(self.fn_ser)
return fn()
then you can call your lambda or whatever is causing the problem like this:
p = multiprocessing.Process(target=forever, args=FunctionWrapper(lambda: print("do something"),))
The point is that the ‘meaningful’ serialization is happening outside the multiprocessing module with whatever library you want. The pickle in multiprocessing only sees a plain object with some string attributes.