psycopg2 leaking memory after large query

Question:

I’m running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). Am I missing a cleanup step?

import psycopg2
conn = psycopg2.connect("dbname='dbname' user='user' host='host'")
cursor = conn.cursor()
cursor.execute("""large query""")
rows = cursor.fetchall()
del rows
cursor.close()
conn.close()
import gc
gc.collect()

Answers:

Please see the next answer by @joeblog for the better solution.


First, you shouldn’t need all that RAM in the first place. What you should be doing here is fetching chunks of the result set. Don’t do a fetchall(). Instead, use the much more efficient cursor.fetchmany method. See the psycopg2 documentation.

Now, the explanation for why it isn’t freed, and why that isn’t a memory leak in the formally correct use of that term.

Most processes don’t release memory back to the OS when it’s freed, they just make it available for re-use elsewhere in the program.

Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance.

What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk() lands up with a few small pieces still in use.

The OS can’t tell whether the program considers this memory still in use or not, so it can’t just claim it back. Since the program doesn’t tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. This is one of the reasons you should have swap space.

It’s possible to write programs that hand memory back to the OS, but I’m not sure that you can do it with Python.

See also:

So: this isn’t actually a memory leak. If you do something else that uses lots of memory, the process shouldn’t grow much if at all, it’ll re-use the previously freed memory from the last big allocation.

Answered By: Craig Ringer

I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.

Instead of

cursor = conn.cursor()

write

cursor = conn.cursor(name="my_cursor_name")

or simpler yet

cursor = conn.cursor("my_cursor_name")

The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors

I found the instructions a little confusing in that I though I’d need to rewrite my SQL to include
“DECLARE my_cursor_name ….” and then a “FETCH count 2000 FROM my_cursor_name” but it turns out psycopg does that all for you under the hood if you simply overwrite the “name=None” default parameter when creating a cursor.

The suggestion above of using fetchone or fetchmany doesn’t resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.

Answered By: joeblog

Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.

import datetime as dt
import psycopg2
import sys
import time

conPG = psycopg2.connect("dbname='myDearDB'")
curPG = conPG.cursor('testCursor')
curPG.itersize = 100000 # Rows fetched at one time from the server

curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
# Warning: curPG.rowcount == -1 ALWAYS !!
cptLigne = 0
for rec in curPG:
   cptLigne += 1
   if cptLigne % 10000 == 0:
      print('.', end='')
      sys.stdout.flush() # To see the progression
conPG.commit() # Also close the cursor
conPG.close()

As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don’t need to use fetchmany for performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn’t need more ram as it uses temporary table.

Answered By: Le Droid
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.