Python ThreadPoolExecutor (concurrent.futures) memory leak

Question

Hello I’m trying to load a big list==list.txt and send it to Function==Do_something() with concurrent.futures.ThreadPoolExecutor
The problem is that whatever I do, the memory gets heavy, At first I thought the reason is that i open list.txt into a variable as (list) and because of that i changed code to the for i in open("list.txt").readlines() from list = open("list.txt").readlines() but still problem alive, Is that Possible Clear Memory Line By Line after Finishing the job?

My Code:

import time
from concurrent.futures import ThreadPoolExecutor



def Do_something(i):
    
    time.sleep(5) #Do Something ~ take few sec 
    
    pass


if __name__ == "__main__":
    #list = open("list.txt").readlines()
    #even with 1 thread code have problem
    with ThreadPoolExecutor(1) as executor:
        try:
            #list.txt == 10,000,000 Line
            [executor.submit(Do_something , i )for i in open("list.txt").readlines()]
            
        except Exception as exx:
            pass

Asked By: Mehdi SH

||

Source

Answer 1

First off, remove the .readlines() call entirely; file objects are already iterables of their lines, so all you’re doing is forcing it to make a list containing all the lines, then another list of all the tasks dispatched using those lines. As a rule, .readlines() never necessary (it’s a microoptimization on just list(fileobj), and when you don’t need a list, you don’t want to use it).

Secondly, you’re explicitly trying to make tasks for all of the input lines up front before getting results from any of the tasks. While avoiding .readlines() saves the overhead of the list wrapping all those lines, you’re still trying to hold them all in memory, one to each task. If you lack the RAM to hold all the tasks at once, you can’t do this.

If you want to queue a certain number of tasks, processing results as they complete and queuing new tasks, you can do something like this (adapted from a patch to make Executor.map avoid the problem you’re experiencing):

import collections
import itertools
import time


def executor_map(executor, fn, *iterables, timeout=None, chunksize=1, prefetch=None):
    """Returns an iterator equivalent to map(fn, iter).
    Args:
        executor: An Executor to submit the tasks to
        fn: A callable that will take as many arguments as there are
            passed iterables.
        timeout: The maximum number of seconds to wait. If None, then there
            is no limit on the wait time.
        chunksize: The size of the chunks the iterable will be broken into
            before being passed to a child process. This argument is only
            used by ProcessPoolExecutor; it is ignored by
            ThreadPoolExecutor.
        prefetch: The number of chunks to queue beyond the number of
            workers on the executor. If None, a reasonable default is used.
    Returns:
        An iterator equivalent to: map(func, *iterables) but the calls may
        be evaluated out-of-order.
    Raises:
        TimeoutError: If the entire result iterator could not be generated
            before the given timeout.
        Exception: If fn(*args) raises for any values.
    """
    if timeout is not None:
        end_time = timeout + time.monotonic()
    if prefetch is None:
        prefetch = executor._max_workers
    if prefetch < 0:
        raise ValueError("prefetch count may not be negative")

    argsiter = zip(*iterables)
    initialargs = itertools.islice(argsiter, executor._max_workers + prefetch)

    fs = collections.deque(executor.submit(fn, *args) for args in initialargs)

    # Yield must be hidden in closure so that the futures are submitted
    # before the first iterator value is required.
    def result_iterator():
        nonlocal argsiter
        try:
            while fs:
                if timeout is None:
                    res = fs.popleft().result()
                else:
                    res = fs.popleft().result(end_time - time.monotonic())

                # Dispatch next task before yielding to keep
                # pipeline full
                if argsiter:
                    try:
                        args = next(argsiter)
                    except StopIteration:
                        argsiter = None
                    else:
                        fs.append(executor.submit(fn, *args))

                yield res
        finally:
            for future in fs:
                future.cancel()
    return result_iterator()

Once you’ve got that map utility, you can change your code to:

if __name__ == "__main__":
    with ThreadPoolExecutor() as executor:
        try:
            #list.txt == 10,000,000 Line
            with open("list.txt") as f:  # Use with statements to get deterministic file close
                for res in executor_map(executor, Do_something, f):
                    pass  # If Do_something returns useful values, you can use them here
                          # with each result going into res

        except Exception as exx:
            pass

which will only have a limited number of tasks in existence at once time (more than the number of workers, but some may already have results you haven’t pulled), with the file being read lazily so it doesn’t blow your RAM.

Answered By: ShadowRanger

Python ThreadPoolExecutor (concurrent.futures) memory leak

Question:

Answers: