Making threads in a pool notice changes to global variables

Question:

I’m running into an interesting circumstance with pools which I don’t fully understand. I was aware that if you edit any variable or object from a thread, changes will not be applied to the main thread and only exist in the isolated reality of that worker thread. Now however I’m noticing that threads in a pool don’t even detect changes to a global variable made by the main thread, if those changes are made after the pool is started or even before then.

import multiprocessing as mp

variable = 0

def double(i):
    return i * variable

def main():
    pool = mp.Pool()
    for result in pool.map(double, [1, 2, 3]):
        print(result)
    variable = 1

main()

Obviously a simplification for the sake of example, in my case I need threads to see updates to the contents of a list modified by the main loop which is an object property. The funny thing is that even if I move variable = 1 before pool = mp.Pool() in my test, the threads always see 0 and never notice the variable changing to 1.

What does work when using objects is changing the variable on the object who’s function is associated with the thread. The weird thing that happens then is performance on the main thread drops significantly as it’s using a lot more CPU each call: It’s as if merely informing the thread pool of changes to a list adds a great amount of effort.

What is the most efficient and cheap way to make a thread pool see changes to a global or object variable modified by the main thread, so each time you run pool.map_async or pool.apply_async threads work with the updated version of that var?

Asked By: MirceaKitsune

||

Answers:

The funny thing is that even if I move variable = 1 before pool = mp.Pool() in my test, the threads always see 0 and never notice the variable changing to 1.

First of all, you need to declare global variable in main. Otherwise Python thinks this is a local variable.

But even if you do this, not much will change. That’s because multiprocessing package spawns (as the name suggests) processes. Not threads. Processes are similar to threads. The main difference is that each process has isolated memory. Meaning a process will never see other process’ memory.

Unless you actually use tools specifically designed for inter-process communication. Python is so kind that it will wrap some of those for you. In particular you can send and retrieve data from pool.map. Simply by passing a list of arguments, and then retrieving the result.

However this is neither cheap nor efficient. At least compared to simple memory manipulation. Python’s multiprocessing communication is implemented on top of pipes. This kind of communication requires object serialization and deserialization on both sides. It is heavy. And therefore you should avoid sending and retrieving big objects. It is how it is.

An alternative is to use multiprocessing.Value and/or multiprocessing.Array. I’m not exactly sure how these are implemented, probably some combination of shared memory with locks. This might be more efficient then previous method, but it has its own limitations.

Answered By: freakish