I wrote a Python program that acts on a large input file to create a few million objects representing triangles. The algorithm is:
The requirement of OFF that I print out the complete list of vertices before I print out the triangles means that I have to hold the list of triangles in memory before I write the output to file. In the meanwhile I’m getting memory errors because of the sizes of the lists.
What is the best way to tell Python that I no longer need some of the data, and it can be freed?
Python is garbage-collected, so if you reduce the size of your list, it will reclaim memory. You can also use the “del” statement to get rid of a variable completely:
biglist = [blah,blah,blah] #... del biglist
I have heard people on Linux and Unix-type systems forking a python process to do some work, getting results and then killing it.
This article has notes on the Python garbage collector, but I think lack of memory control is the downside to managed memory
You can’t explicitly free memory. What you need to do is to make sure you don’t keep references to objects. They will then be garbage collected, freeing the memory.
In your case, when you need large lists, you typically need to reorganize the code, typically using generators/iterators instead. That way you don’t need to have the large lists in memory at all.
According to Python Official Documentation, you can explicitly invoke the Garbage Collector to release unreferenced memory with
import gc gc.collect()
You should do that after marking what you want to discard using
del my_array del my_object gc.collect()
Unfortunately (depending on your version and release of Python) some types of objects use “free lists” which are a neat local optimization but may cause memory fragmentation, specifically by making more and more memory “earmarked” for only objects of a certain type and thereby unavailable to the “general fund”.
The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it’s done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the
multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.
In your use case, it seems that the best way for the subprocesses to accumulate some results and yet ensure those results are available to the main process is to use semi-temporary files (by semi-temporary I mean, NOT the kind of files that automatically go away when closed, just ordinary files that you explicitly delete when you’re all done with them).
If you don’t care about vertex reuse, you could have two output files–one for vertices and one for triangles. Then append the triangle file to the vertex file when you are done.
Others have posted some ways that you might be able to “coax” the Python interpreter into freeing the memory (or otherwise avoid having memory problems). Chances are you should try their ideas out first. However, I feel it important to give you a direct answer to your question.
There isn’t really any way to directly tell Python to free memory. The fact of that matter is that if you want that low a level of control, you’re going to have to write an extension in C or C++.
That said, there are some tools to help with this:
del can be your friend, as it marks objects as being deletable when there no other references to them. Now, often the CPython interpreter keeps this memory for later use, so your operating system might not see the “freed” memory.)
Maybe you would not run into any memory problem in the first place by using a more compact structure for your data.
Thus, lists of numbers are much less memory-efficient than the format used by the standard
array module or the third-party
numpy module. You would save memory by putting your vertices in a NumPy 3xN array and your triangles in an N-element array.
I had a similar problem in reading a graph from a file. The processing included the computation of a 200 000×200 000 float matrix (one line at a time) that did not fit into memory. Trying to free the memory between computations using
gc.collect() fixed the memory-related aspect of the problem but it resulted in performance issues: I don’t know why but even though the amount of used memory remained constant, each new call to
gc.collect() took some more time than the previous one. So quite quickly the garbage collecting took most of the computation time.
To fix both the memory and performance issues I switched to the use of a multithreading trick I read once somewhere (I’m sorry, I cannot find the related post anymore). Before I was reading each line of the file in a big
for loop, processing it, and running
gc.collect() every once and a while to free memory space. Now I call a function that reads and processes a chunk of the file in a new thread. Once the thread ends, the memory is automatically freed without the strange performance issue.
Practically it works like this:
from dask import delayed # this module wraps the multithreading def f(storage, index, chunk_size): # the processing function # read the chunk of size chunk_size starting at index in the file # process it using data in storage if needed # append data needed for further computations to storage return storage partial_result = delayed() # put into the delayed() the constructor for your data structure # I personally use "delayed(nx.Graph())" since I am creating a networkx Graph chunk_size = 100 # ideally you want this as big as possible while still enabling the computations to fit in memory for index in range(0, len(file), chunk_size): # we indicates to dask that we will want to apply f to the parameters partial_result, index, chunk_size partial_result = delayed(f)(partial_result, index, chunk_size) # no computations are done yet ! # dask will spawn a thread to run f(partial_result, index, chunk_size) once we call partial_result.compute() # passing the previous "partial_result" variable in the parameters assures a chunk will only be processed after the previous one is done # it also allows you to use the results of the processing of the previous chunks in the file if needed # this launches all the computations result = partial_result.compute() # one thread is spawned for each "delayed" one at a time to compute its result # dask then closes the tread, which solves the memory freeing issue # the strange performance issue with gc.collect() is also avoided
As other answers already say, Python can keep from releasing memory to the OS even if it’s no longer in use by Python code (so
gc.collect() doesn’t free anything) especially in a long-running program. Anyway if you’re on Linux you can try to release memory by invoking directly the libc function
malloc_trim (man page).
import ctypes libc = ctypes.CDLL("libc.so.6") libc.malloc_trim(0)