Using pandas in django, how to release memory?

Question:

In a django project that focuses on data analysis, I use pandas to transform the data to a required format. The process is nice and easy, but the project quickly starts using 1GB of RAM.

I know that Python doesn’t really free up memory in this case (https://stackoverflow.com/a/39377643/2115409) and that pandas might have an issue (https://github.com/pandas-dev/pandas/issues/2659).

How do I use pandas in a django project without exploding memory?

Asked By: Private

||

Answers:

There are a couple of things that you can do to have less memory consumption in your Django project. The link that you shared on releasing memory highlights a few of them. However, since you’re also dealing with a web framework, there are a couple of other things that you can do.

  • Read data in smaller chunks: When you read data, it is immediately loaded into the memory for processing. Avoid loading everything at once.

  • Write data in smaller chunks: Similar to reading data in chunks, write data in chunks rather than all at once. This would reduce bulk-memory consumption while writing large data frames to disk.

  • Ensure correct data types: Don’t rely on default data types that Pandas assign to your data. For example, your data might easily fit into int32 or float32 and pandas might have int64/float64 assigned to them. Be explicit where possible (Remember Zen of Python)

  • See for code-level optimizations: specifically avoid making a lot of copies of your dataframe. Do in-place transformations rather than re-creating a dataframe for each operation.

  • Use generators instead of lists as it does not load all data in memory, but yield it once necessary.

  • Avoid doing manipulation in request/response cycle: Finally, avoid doing pandas manipulation in the request/response cycle of Django requests.

  • Use queues: Shift the data manipulation to task queues such as celery. (You can also separate these queues on managed services from the cloud), or just another EC2 instance, for example. You can use distributed queues to further scale your system and lower memory consumption.

  • Querysets: Use Django querysets efficiently. They are lazily loaded. Ensure you can filter (read chunk) them depending on your needs.

  • Dataframes up for garbage collection: Make sure you use del when you’re done with a dataframe. This would remove the reference to the dataframe. Then you can run gc.collect() function to trigger the garbage collector explicitly to free-up memory.

You may also use gc.set_threshold function to further tune the garbage collector.

A little example of this might look like this:

import gc
import pandas as pd

def service_layer_function():
    # generate dataframe
    # get done with it.
    # explicitly release reference to df
    del df
    # trigger garbage collector
    gc.collect()
Answered By: Sanyam Khurana
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.