python – accessing manager object in shared state multiprocessing

Question:

I have a program that populates a shared data structure between processes. This is a customised implementation of a HashMap with separate chaining functionality for items with the same key (hash). The class is defined as follows:

class HashMapChain:
    """A HashMap with Separate Chaining for key collisions.

    Attributes:
        map: A ``key-value dict`` where ``value`` is a ``list`` object.
        num_items: An ``int`` count of the total number of items stored.
    """

    def __init__(self, manager: Optional[SyncManager] = None) -> None:
        """Initialize the map instance to support being shared between processes.

        Args:
            manager (Optional[SyncManager], optional): If provided, ``self.map`` will be a :class:`DictProxy`, shared among processes. Defaults to ``None``.
        """
        if manager:
            self.map: Union[Dict[Any, Any], DictProxy[Any, Any]] = manager.dict()
        else:
            self.map = dict()
        self.num_items: int = 0

    def insert(self, key, value, manager: Optional[SyncManager] = None):
        """Insert ``value`` into the ``HashMap``.

        Args:
            key (Any): Unique lookup key in the map.
            value (Any): The value to store in the map.
            manager (Optional[SyncManager], optional): If provided, the ``list`` will be a :class:`ListProxy`. Defaults to None.
        """
        if key not in self.map: # New List for new hashed item
            if manager:
                self.map[key] = manager.list()
            else:
                self.map[key] = list()

        vals = self.map[key]
        if value not in vals:
            vals.append(value)
            self.num_items += 1

In the data structure above, I wanted it so that in a non-multiprocessing environment I would have an object where I had HashMap[Dict, List[Any]] and in a multiprocessing environment it will be a HashMap[DictProxy, ListProxy]. The desired data layout will be of the form:

hashmap["k1"] -> ["some", "values", "mapped", "to", "the same key1"]
hashmap["k2"] -> ["other", "vals", "mapped", "to", "the same key2"] 

Here is the rest of the code using this data structure.

def callback(hashmap: HashMapChain, manager: SyncManager):
    key, value = getItemFromDb()
    hashmap.insert(key=key, value=value, manager=manager)

def main():
    with Manager() as manager:
        hashmap = HashMapChain(manager=manager)
        processes = []
        for _ in range(5):
            process = Process(target=callback, args=(hashmap, manager))
            process.start() # <-- Exception occurs here.
            processes.append(process)
        for process in processes:
            process.join()
            

if __name__ == 'main':
    main()

My issue was since I need access to the manager to create a new DictProxy or ListProxy in the HashMapChain.insert() method, how could I pass that in callback()

When I run this piece of code, I get a TypeError: cannot pickle 'weakref' object. This happens because I am passing the manager reference to the subprocesses.

Note: What I found interesting is that this error only fires when I run my code on a Mac OS X. When I run this on Linux, it works just fine.

Is there a way I could have approached this design differently? Why does this work fine in Linux?

Asked By: i_use_the_internet

||

Answers:

Incase it was missed, self.num_items will not be shared between the main and child processes’ HashMap instance, so do keep that in mind if it changes anything.

As you pointed out, the error is happening because you cannot pickle manager objects. The code works fine in Linux because UNIX systems (except macOS) use fork by default, which does not need to pickle data to send them across processes. So a quick fix could be to change your start method from spawn to fork when using mac. But if that’s out of question, then you would need to come up with a solution that doesn’t involve sharing a manager.

I previously wrote an answer detailing such a solution, where instead of you manually nesting a manager.list, you let the outer manager.dict (your self.map variable) automatically handle such datatypes by sharing and creating appropriate proxies for them (so you don’t have to manually create and share managers). You could adapt it to your needs here, with the added benefit that you would not need to constantly check if the manager parameter was supplied or not. An example is below:

from multiprocessing.managers import SyncManager, MakeProxyType, ListProxy, State
import multiprocessing
from multiprocessing import Process
from collections import UserDict


def init():
    global manager
    global lock

    manager = SyncManager()
    lock = multiprocessing.Lock()


class ManagerDict(UserDict):

    def __check_state(self):
        global manager
        global lock

        # Managers are not thread-safe, protect starting one from within another with a lock
        with lock:
            if manager._state.value != State.STARTED:
                manager.start(initializer=init)

    def __setitem__(self, key, value):
        global manager
        self.__check_state()

        if isinstance(value, list):
            value = manager.list(value)
        elif isinstance(value, dict):
            value = manager.dict(value)
        return super().__setitem__(key, value)


ManagerDictProxy = MakeProxyType('DictProxy', (
    '__contains__', '__delitem__', '__getitem__', '__iter__', '__len__',
    '__setitem__', 'clear', 'copy', 'get', 'items',
    'keys', 'pop', 'popitem', 'setdefault', 'update', 'values'
    ))
ManagerDictProxy._method_to_typeid_ = {
    '__iter__': 'Iterator',
    }

SyncManager.register('list', list, ListProxy)
SyncManager.register('dict', ManagerDict, ManagerDictProxy)


class HashMapChain:
    """A HashMap with Separate Chaining for key collisions.

    Attributes:
        map: A ``key-value dict`` where ``value`` is a ``list`` object.
        num_items: An ``int`` count of the total number of items stored.
    """

    def __init__(self, manager=None) -> None:
        if manager:
            self.map = manager.dict()
        else:
            self.map = dict()
        self.num_items: int = 0

    def insert(self, key, value):
        """Insert ``value`` into the ``HashMap``.

        Args:
            key (Any): Unique lookup key in the map.
            value (Any): The value to store in the map.
        """
        if key not in self.map: # New List for new hashed item
            self.map[key] = []

        vals = self.map[key]
        if value not in vals:
            vals.append(value)
            self.num_items += 1


def callback(key, hashmap: HashMapChain):
    value = "the same " + key
    hashmap.insert(key=key, value=value)


def main():
    # Note that we are using a SyncManager, not Manager
    m = SyncManager()
    m.start(initializer=init)

    hashmap = HashMapChain(manager=m)
    processes = []
    for i in range(5):
        process = Process(target=callback, args=(f"k{i}", hashmap))
        process.start()  # <-- Exception occurs here.
        processes.append(process)
    for process in processes:
        process.join()

    print(hashmap.map["k0"])


def main2():
    hashmap = HashMapChain()
    for i in range(5):
        callback(f"k{i}", hashmap)
    print(hashmap.map)


if __name__ == '__main__':
    main()  # Multiprocessing approach
    main2()  # Normal approach

Output

['the same k0']
{'k0': ['the same k0'], 'k1': ['the same k1'], 'k2': ['the same k2'], 'k3': ['the same k3'], 'k4': ['the same k4']}

Update

What is the difference between the global values in init vs in ManagerDict?

There is no difference. When you spawn a manager, python actually starts a separate process. All shared data is then stored in this manager process. The initializer kwarg (in manager.start(..)) is used to call a function in that manager process before the manager is functional. So basically, by setting the initializer as our function init, we are creating two variables and making them global so that the ManagerDict (which will be called from within the manager process) can access them. It follows that if you were to omit this kwarg when starting the manager, your ManagerDict will raise a NameError since the lock and manager haven’t been defined. Similarly, trying to access these global variables in your main process will raise the same error as well (you are not calling init in the main process, only passing it as an argument when starting the manager).

If the manager is created in main2, which module registers the ManagerDict and ListProxy and how will init() and ManagerDict.__check_state, ManagerDict.setitem look

There seems to be some confusion here. The manager that you create in function main is not the same manager that ManagerDict has access to (note how in main the manager is assigned to variable m). That class (along with the init function) creates and handles it’s own manager and is therefore completely localized. So if you wanted to create a separate module, say proxies.py, then you only need to dump the class ManagerDict, function init, and all the register statements in that module. Then, in your main.py; which contains the if __name__... block, class HashMapChain, etc; you only need to include the statement import proxies for it all to work as expected (make sure to pass the initializer kwarg though!).

As a sidenote, it’s probably a good idea to subclass SyncManager in proxies.py and make all changes to that subclass only

Answered By: Charchit Agarwal