ProcessPoolExecutor does not mutate instance variable when submitting instance method

Question:

Given an instance method that mutates an instance variable, running this method in the ProcessPoolExecutor does run the method but does not mutate the instance variable.

from concurrent.futures import ProcessPoolExecutor


class A:
    def __init__(self):
        self.started = False

    def method(self):
        print("Started...")
        self.started = True


if __name__ == "__main__":
    a = A()

    with ProcessPoolExecutor() as executor:
        executor.submit(a.method)

    assert a.started
Started...
Traceback (most recent call last):
  File "/path/to/file", line 19, in <module>
    assert a.started
AssertionError

Are only pure functions allowed in ProcessPoolExecutor?

Asked By: Keto

||

Answers:

For Windows

Multiprocessing does not share it’s state with the child processes on Windows systems. This is because the default way to start child processes on Windows is through spawn. From the documentation for method spawn

The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver

Therefore, when you pass any objects to child processes, they are actually copied, and do not have the same memory address as in the parent process. A simple way to demonstrate this through your example would be to print the objects in the child process and the parent process:

from concurrent.futures import ProcessPoolExecutor


class A:
    def __init__(self):
        self.started = False

    def method(self):
        print("Started...")
        print(f'Child proc: {self}')
        self.started = True


if __name__ == "__main__":
    a = A()
    print(f'Parent proc: {a}')
    with ProcessPoolExecutor() as executor:
        executor.submit(a.method)

Output

Parent proc: <__main__.A object at 0x0000028F44B40FD0>
Started...
Child proc: <__mp_main__.A object at 0x0000019D2B8E64C0>

As you can see, both objects reside at different places in the memory. Altering one would not affect the other whatsoever. This is the reason why you don’t see any changes to a.started in the parent process.

Once you understand this, your question then becomes then how to share the same object, rather than copies, to the child processes. There are numerous ways to go about this, and questions on how to share complex objects like a have already been asked and answered on stackoverflow.

For UNIX

The same could be said for other methods of starting new processes that UNIX based systems have the option of using (I am not sure the default for concurrent.futures on OSX). For example, from the documentation for multiprocessing, fork is explained as

The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.

So fork creates child processes that share the entire memory space of the parent process on start. However, it uses copy-on-write to do so. What this means is that if you attempt to modify any object that is shared from within the child process, it will have to create a duplicate of that particular object as to not interrupt the parent process and localize that object to the child process (much like what spawn does on start).

Hence the answer still stands: if you plan to modify the objects passed to the child process, or if you are not on UNIX systems, you will need to share the objects yourself to have them point to the same memory address

Further reading on start methods.

Answered By: Charchit Agarwal

Adding to what @Charchit explained –

Starting from python 3.8, the default behaviour on OSX is spawn. However this can be controlled by passing a context to the ThreadPoolExecutor constructor like this –

mp_context = multiprocessing.get_context("fork")
with ProcessPoolExecutor(max_workers=1000, mp_context=mp_context) as executor:
    executor.submit(sum, [1,2])

From official doc

Changed in version 3.8: On macOS, the spawn start method is now the default. The fork start method should be considered unsafe as it can lead to crashes of the subprocess as macOS system libraries may start threads.

Important
Using fork in your process pool will slow down the pool creation and execution of tasks in the pool on OSX systems (That’s the reason default behaviour is set to spawn). But the same code (with fork) will run must faster on non-OSX unix systems.

So, if you are writing a piece of code on you OSX that is expected to run on a non-OSX system on production, use fork to create new child processes.

Answered By: Prashant Mishra
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.