ProcessPoolExecutor using map hang on large load

Question:

Experiencing hangs running ProcessPoolExecutor on map, only on a relatively large load.

The behaviour we see is that after about 1 minutes of hard working, job seems to hang: the CPU utilization drops sharply then becomes idle; the stack trace also seems to show the same portion of calls as time progresses.

def work_wrapper(args):
    return work(*args)

def work():
    work.....

def start_working(...):
    with concurrent.futures.ProcessPoolExecutor(max_workers=num_threads, mp_context=mp.get_context('fork')) as executor:
        args = [arg_list1, arg_list2, ...]
        for res in executor.map(work_wrapper, args):
            pass

if __name__ == "__main__":
    mp.set_start_method('fork',force=True)
    start_working(...)

Stack trace (we log every 5 minutes but they appear pretty similar):

Thread 0x00007f4d0ca27700 (most recent call first):
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373 in _send
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 402 in _send_bytes
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205 in send_bytes
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 250 in _feed
File "/usr/local/lib/python3.10/threading.py", line 953 in run
File "/usr/local/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/local/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f4d156fc700 (most recent call first):
File "/usr/local/lib/python3.10/threading.py", line 1116 in _wait_for_tstate_lock
File "/usr/local/lib/python3.10/threading.py", line 1096 in join
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 199 in _finalize_join
File "/usr/local/lib/python3.10/multiprocessing/util.py", line 224 in __call__
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 151 in join_thread
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 515 in join_executor_internals
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 469 in terminate_broken
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 323 in run
File "/usr/local/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/local/lib/python3.10/threading.py", line 973 in _bootstrap
Thread 0x00007f4d19cce740 (most recent call first):
File "/usr/local/lib/python3.10/threading.py", line 1116 in _wait_for_tstate_lock
File "/usr/local/lib/python3.10/threading.py", line 1096 in join
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 775 in shutdown
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 649 in __exit__
File "/app/main.py", line 256 in start_working
File "/app/main.py", line 51 in main
File "/app/main.py", line 96 in <module>
File "/app/main.py", line 96 in <module>
File "/app/main.py", line 51 in main
File "/app/main.py", line 256 in start_working
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 649 in __exit__
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 775 in shutdown
File "/usr/local/lib/python3.10/threading.py", line 1096 in join
File "/usr/local/lib/python3.10/threading.py", line 1116 in _wait_for_tstate_lock
Thread 0x00007f4d19cce740 (most recent call first):
File "/usr/local/lib/python3.10/threading.py", line 973 in _bootstrap
File "/usr/local/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 323 in run
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 469 in terminate_broken
File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 515 in join_executor_internals
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 151 in join_thread
File "/usr/local/lib/python3.10/multiprocessing/util.py", line 224 in __call__
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 199 in _finalize_join
File "/usr/local/lib/python3.10/threading.py", line 1096 in join
File "/usr/local/lib/python3.10/threading.py", line 1116 in _wait_for_tstate_lock
Thread 0x00007f4d156fc700 (most recent call first):
File "/usr/local/lib/python3.10/threading.py", line 973 in _bootstrap
File "/usr/local/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/local/lib/python3.10/threading.py", line 953 in run
File "/usr/local/lib/python3.10/multiprocessing/queues.py", line 250 in _feed
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 205 in send_bytes
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 402 in _send_bytes
File "/usr/local/lib/python3.10/multiprocessing/connection.py", line 373 in _send
Thread 0x00007f4d0ca27700 (most recent call first):

Python version: 3.10.8, Docker base image: python:3.10-slim

I tried updating python version, changing multiprocessing context (tried both spawn and fork, both give same behaviour)

Asked By: NedStark

||

Answers:

The problem you’re having is due to Executor.map not handling large/infinite iterable inputs in a sane way. Before it yields a single value, it consumes the entire input iterator and submits a task for every input.

If your inputs are produced lazily (on the theory that this would keep memory usage down), nope, they’re all read in immediately. If they’re infinite (with the assumption you can break and stop when receiving a specific result), nope, the program will try to submit infinite tasks and you’ll run out of memory. If they’re just huge, well, you’ll pay the overhead for all the submitted tasks (management overhead, pickling them to pass them to the workers, etc.).

If you can avoid processing in blocks quite that large, that’s an easy fix. You could also copy the fixed implementation of Executor.map in the PR attached to the issue I linked as a top-level function, manually passing an executor to it to act as the self argument (it’s all implemented in terms of submit calls to the underlying executor, it doesn’t need to be an instance method); the fixed version, by default, pulls and submits twice as many tasks as the pool has workers, then only pulls and submits additional tasks as the original tasks complete and the caller requests them (so if you’re looping over the results live, and not storing them, the additional memory costs are proportionate to the number of workers, typically small and fixed, not the total number of inputs, which can be huge).

Here’s an adapted version (untested, please comment if I typoed somewhere):

import collections
import itertools
import time


def executor_map(executor, fn, *iterables, timeout=None, chunksize=1, prefetch=None):
    """Returns an iterator equivalent to map(fn, iter).
    Args:
        executor: An Executor to submit the tasks to
        fn: A callable that will take as many arguments as there are
            passed iterables.
        timeout: The maximum number of seconds to wait. If None, then there
            is no limit on the wait time.
        chunksize: The size of the chunks the iterable will be broken into
            before being passed to a child process. This argument is only
            used by ProcessPoolExecutor; it is ignored by
            ThreadPoolExecutor.
        prefetch: The number of chunks to queue beyond the number of
            workers on the executor. If None, a reasonable default is used.
    Returns:
        An iterator equivalent to: map(func, *iterables) but the calls may
        be evaluated out-of-order.
    Raises:
        TimeoutError: If the entire result iterator could not be generated
            before the given timeout.
        Exception: If fn(*args) raises for any values.
    """
    if timeout is not None:
        end_time = timeout + time.monotonic()
    if prefetch is None:
        prefetch = executor._max_workers
    if prefetch < 0:
        raise ValueError("prefetch count may not be negative")

    argsiter = zip(*iterables)
    initialargs = itertools.islice(argsiter, executor._max_workers + prefetch)

    fs = collections.deque(executor.submit(fn, *args) for args in initialargs)

    # Yield must be hidden in closure so that the futures are submitted
    # before the first iterator value is required.
    def result_iterator():
        nonlocal argsiter
        try:
            while fs:
                if timeout is None:
                    res = fs.popleft().result()
                else:
                    res = fs.popleft().result(end_time - time.monotonic())

                # Dispatch next task before yielding to keep
                # pipeline full
                if argsiter:
                    try:
                        args = next(argsiter)
                    except StopIteration:
                        argsiter = None
                    else:
                        fs.append(executor.submit(fn, *args))

                yield res
        finally:
            for future in fs:
                future.cancel()
    return result_iterator()

which you could then use to change your code to:

def start_working(...):
    with concurrent.futures.ProcessPoolExecutor(max_workers=num_threads, mp_context=mp.get_context('fork')) as executor:
        args = [arg_list1, arg_list2, ...]
        for res in executor_map(executor, work_wrapper, args):
            pass
Answered By: ShadowRanger