Why I can't use multiprocessing.Queue with ProcessPoolExecutor?

Question:

When I run the below code:

from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue

q = Queue()

def my_task(x, queue):
    queue.put("Task Complete")
    return x

with ProcessPoolExecutor() as executor:
    tasks = [executor.submit(my_task, i, q) for i in range(10)]
    for task in as_completed(tasks):
        print(task.result())

I get this error:

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
    context.assert_spawning(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
    raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/nn.py", line 14, in <module>
    print(task.result())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 58, in __getstate__
    context.assert_spawning(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 373, in assert_spawning
    raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance

What is the purpose of multiprocessing.Queue if I cannot using for multiprocessing? How can I make this to work? In my real code, I need every worker to update a queue frequently about the task status so another thread will get data from that queue to feed a progress bar.

Asked By: chrislamp

||

Answers:

Short Explanation

Why can’t you pass a multiprocessing.Queue as a worker function argument? The short answer is that submitted tasks are submitted to a transparent input queue from which the pool processes get the next task to be performed. But these arguments must be serializable with pickle and a multiprocessing.Queue is not in general serializable. But it is serializable for the special case of passing an argument to a child process as a function argument. Arguments to a multiprocessing.Process are stored as an attribute of the instance when it is created. When start is called on the instance, its state must be serialized to the new address space before the run method is called in that new address space. Why this serialization works for this case but not the general case is unclear to me; I would have to spend a lot of time looking at the source for the interpreter to come up with a definitive answer.

See what happens when I try to put a queue instance to a queue:

>>> from multiprocessing import Queue
>>> q1 = Queue()
>>> q2 = Queue()
>>> q1.put(q2)
>>> Traceback (most recent call last):
  File "C:Program FilesPython38libmultiprocessingqueues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "C:Program FilesPython38libmultiprocessingreduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "C:Program FilesPython38libmultiprocessingqueues.py", line 58, in __getstate__
    context.assert_spawning(self)
  File "C:Program FilesPython38libmultiprocessingcontext.py", line 359, in assert_spawning
    raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance

>>> import pickle
>>> b = pickle.dumps(q2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:Program FilesPython38libmultiprocessingqueues.py", line 58, in __getstate__
    context.assert_spawning(self)
  File "C:Program FilesPython38libmultiprocessingcontext.py", line 359, in assert_spawning
    raise RuntimeError(
RuntimeError: Queue objects should only be shared between processes through inheritance
>>>

How to Pass the Queue via Inheritance

First of all your code will run more slowly using multiprocessing then if you had just called my_task in a loop because multiprocessing introduces additional overhead (starting of processes and moving data across address spaces) which requires that what you gain from running my_task in parallel more than offsets the additional overhead. In your case it doesn’t because my_task is not sufficiently CPU-intensive as to justify multiprocessing.

That said, when you wish to have your pool processes using a multiprocessing.Queue instance, it cannot be passed as an argument to a worker function (unlike the case when you are using explicitly multiprocessing.Process instances instead of a pool). Instead, you must initialize a global variable in each pool process with the queue instance.

If you are running under a platform that uses fork to create new processes, then you can just create queue as a global and it will be inherited by each pool process:

from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue

queue = Queue()

def my_task(x):
    queue.put("Task Complete")
    return x

with ProcessPoolExecutor() as executor:
    tasks = [executor.submit(my_task, i) for i in range(10)]
    for task in as_completed(tasks):
        print(task.result())
    # This queue must be read before the pool terminates:
    for _ in range(10):
        print(queue.get())

Prints:

1
0
2
3
6
5
4
7
8
9
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete
Task Complete

If you need portability with platforms that do not use the fork method to create processes, such as Windows (which uses the spawn method), then you cannot allocate the queue as a global since each pool process will create its own queue instance. Instead, the main process must create the queue and then initialize each pool process’ global queue variable by using the initializer and initargs:

from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import Queue

def init_pool_processes(q):
    global queue

    queue = q

def my_task(x):
    queue.put("Task Complete")
    return x

# Windows compatibilitY
if __name__ == '__main__':
    q = Queue()

    with ProcessPoolExecutor(initializer=init_pool_processes, initargs=(q,)) as executor:
        tasks = [executor.submit(my_task, i) for i in range(10)]
        for task in as_completed(tasks):
            print(task.result())
        # This queue must be read before the pool terminates:
        for _ in range(10):
            print(q.get())

If you want to advance a progress bar as each task completes (you haven’t precisely stated how the bar is to advance; see my comment to your question), then the following shows that a queue is necessary. If each task submitted consisted of N parts (for a total of 10 * N parts, since there are 10 tasks) and would like to see a single progress bar advance as each part is completed, then a queue is probably the most straight forward way of signaling a part completion back to the main process.

from concurrent.futures import ProcessPoolExecutor, as_completed
from tqdm import tqdm

def my_task(x):
    return x

# Windows compatibilitY
if __name__ == '__main__':
    with ProcessPoolExecutor() as executor:
        with tqdm(total=10) as bar:
            tasks = [executor.submit(my_task, i) for i in range(10)]
            for _ in as_completed(tasks):
                bar.update()
            # To get the results in task submission order:
            results = [task.result() for task in tasks]
    print(results)
Answered By: Booboo