Python ThreadPoolExecutor (concurrent.futures) memory leak
Question:
Hello I’m trying to load a big list==list.txt
and send it to Function==Do_something()
with concurrent.futures.ThreadPoolExecutor
The problem is that whatever I do, the memory gets heavy, At first I thought the reason is that i open list.txt
into a variable as (list) and because of that i changed code to the for i in open("list.txt").readlines()
from list = open("list.txt").readlines()
but still problem alive, Is that Possible Clear Memory Line By Line after Finishing the job?
My Code:
import time
from concurrent.futures import ThreadPoolExecutor
def Do_something(i):
time.sleep(5) #Do Something ~ take few sec
pass
if __name__ == "__main__":
#list = open("list.txt").readlines()
#even with 1 thread code have problem
with ThreadPoolExecutor(1) as executor:
try:
#list.txt == 10,000,000 Line
[executor.submit(Do_something , i )for i in open("list.txt").readlines()]
except Exception as exx:
pass
Answers:
First off, remove the .readlines()
call entirely; file objects are already iterables of their lines, so all you’re doing is forcing it to make a list
containing all the lines, then another list
of all the tasks dispatched using those lines. As a rule, .readlines()
never necessary (it’s a microoptimization on just list(fileobj)
, and when you don’t need a list
, you don’t want to use it).
Secondly, you’re explicitly trying to make tasks for all of the input lines up front before getting results from any of the tasks. While avoiding .readlines()
saves the overhead of the list
wrapping all those lines, you’re still trying to hold them all in memory, one to each task. If you lack the RAM to hold all the tasks at once, you can’t do this.
If you want to queue a certain number of tasks, processing results as they complete and queuing new tasks, you can do something like this (adapted from a patch to make Executor.map
avoid the problem you’re experiencing):
import collections
import itertools
import time
def executor_map(executor, fn, *iterables, timeout=None, chunksize=1, prefetch=None):
"""Returns an iterator equivalent to map(fn, iter).
Args:
executor: An Executor to submit the tasks to
fn: A callable that will take as many arguments as there are
passed iterables.
timeout: The maximum number of seconds to wait. If None, then there
is no limit on the wait time.
chunksize: The size of the chunks the iterable will be broken into
before being passed to a child process. This argument is only
used by ProcessPoolExecutor; it is ignored by
ThreadPoolExecutor.
prefetch: The number of chunks to queue beyond the number of
workers on the executor. If None, a reasonable default is used.
Returns:
An iterator equivalent to: map(func, *iterables) but the calls may
be evaluated out-of-order.
Raises:
TimeoutError: If the entire result iterator could not be generated
before the given timeout.
Exception: If fn(*args) raises for any values.
"""
if timeout is not None:
end_time = timeout + time.monotonic()
if prefetch is None:
prefetch = executor._max_workers
if prefetch < 0:
raise ValueError("prefetch count may not be negative")
argsiter = zip(*iterables)
initialargs = itertools.islice(argsiter, executor._max_workers + prefetch)
fs = collections.deque(executor.submit(fn, *args) for args in initialargs)
# Yield must be hidden in closure so that the futures are submitted
# before the first iterator value is required.
def result_iterator():
nonlocal argsiter
try:
while fs:
if timeout is None:
res = fs.popleft().result()
else:
res = fs.popleft().result(end_time - time.monotonic())
# Dispatch next task before yielding to keep
# pipeline full
if argsiter:
try:
args = next(argsiter)
except StopIteration:
argsiter = None
else:
fs.append(executor.submit(fn, *args))
yield res
finally:
for future in fs:
future.cancel()
return result_iterator()
Once you’ve got that map
utility, you can change your code to:
if __name__ == "__main__":
with ThreadPoolExecutor() as executor:
try:
#list.txt == 10,000,000 Line
with open("list.txt") as f: # Use with statements to get deterministic file close
for res in executor_map(executor, Do_something, f):
pass # If Do_something returns useful values, you can use them here
# with each result going into res
except Exception as exx:
pass
which will only have a limited number of tasks in existence at once time (more than the number of workers, but some may already have results you haven’t pulled), with the file being read lazily so it doesn’t blow your RAM.
Hello I’m trying to load a big list==list.txt
and send it to Function==Do_something()
with concurrent.futures.ThreadPoolExecutor
The problem is that whatever I do, the memory gets heavy, At first I thought the reason is that i open list.txt
into a variable as (list) and because of that i changed code to the for i in open("list.txt").readlines()
from list = open("list.txt").readlines()
but still problem alive, Is that Possible Clear Memory Line By Line after Finishing the job?
My Code:
import time
from concurrent.futures import ThreadPoolExecutor
def Do_something(i):
time.sleep(5) #Do Something ~ take few sec
pass
if __name__ == "__main__":
#list = open("list.txt").readlines()
#even with 1 thread code have problem
with ThreadPoolExecutor(1) as executor:
try:
#list.txt == 10,000,000 Line
[executor.submit(Do_something , i )for i in open("list.txt").readlines()]
except Exception as exx:
pass
First off, remove the .readlines()
call entirely; file objects are already iterables of their lines, so all you’re doing is forcing it to make a list
containing all the lines, then another list
of all the tasks dispatched using those lines. As a rule, .readlines()
never necessary (it’s a microoptimization on just list(fileobj)
, and when you don’t need a list
, you don’t want to use it).
Secondly, you’re explicitly trying to make tasks for all of the input lines up front before getting results from any of the tasks. While avoiding .readlines()
saves the overhead of the list
wrapping all those lines, you’re still trying to hold them all in memory, one to each task. If you lack the RAM to hold all the tasks at once, you can’t do this.
If you want to queue a certain number of tasks, processing results as they complete and queuing new tasks, you can do something like this (adapted from a patch to make Executor.map
avoid the problem you’re experiencing):
import collections
import itertools
import time
def executor_map(executor, fn, *iterables, timeout=None, chunksize=1, prefetch=None):
"""Returns an iterator equivalent to map(fn, iter).
Args:
executor: An Executor to submit the tasks to
fn: A callable that will take as many arguments as there are
passed iterables.
timeout: The maximum number of seconds to wait. If None, then there
is no limit on the wait time.
chunksize: The size of the chunks the iterable will be broken into
before being passed to a child process. This argument is only
used by ProcessPoolExecutor; it is ignored by
ThreadPoolExecutor.
prefetch: The number of chunks to queue beyond the number of
workers on the executor. If None, a reasonable default is used.
Returns:
An iterator equivalent to: map(func, *iterables) but the calls may
be evaluated out-of-order.
Raises:
TimeoutError: If the entire result iterator could not be generated
before the given timeout.
Exception: If fn(*args) raises for any values.
"""
if timeout is not None:
end_time = timeout + time.monotonic()
if prefetch is None:
prefetch = executor._max_workers
if prefetch < 0:
raise ValueError("prefetch count may not be negative")
argsiter = zip(*iterables)
initialargs = itertools.islice(argsiter, executor._max_workers + prefetch)
fs = collections.deque(executor.submit(fn, *args) for args in initialargs)
# Yield must be hidden in closure so that the futures are submitted
# before the first iterator value is required.
def result_iterator():
nonlocal argsiter
try:
while fs:
if timeout is None:
res = fs.popleft().result()
else:
res = fs.popleft().result(end_time - time.monotonic())
# Dispatch next task before yielding to keep
# pipeline full
if argsiter:
try:
args = next(argsiter)
except StopIteration:
argsiter = None
else:
fs.append(executor.submit(fn, *args))
yield res
finally:
for future in fs:
future.cancel()
return result_iterator()
Once you’ve got that map
utility, you can change your code to:
if __name__ == "__main__":
with ThreadPoolExecutor() as executor:
try:
#list.txt == 10,000,000 Line
with open("list.txt") as f: # Use with statements to get deterministic file close
for res in executor_map(executor, Do_something, f):
pass # If Do_something returns useful values, you can use them here
# with each result going into res
except Exception as exx:
pass
which will only have a limited number of tasks in existence at once time (more than the number of workers, but some may already have results you haven’t pulled), with the file being read lazily so it doesn’t blow your RAM.