Psycopg2 uses up memory on large select query
Question:
I am using psycopg2 to query a Postgresql database and trying to process all rows from a table with about 380M rows. There are only 3 columns (id1, id2, count) all of type integer. However, when I run the straightforward select query below, the Python process starts consuming more and more memory, until it gets killed by the OS.
Minimal working example (assuming that mydatabase exists and contains a table called mytable):
import psycopg2
conn = psycopg2.connect("dbname=mydatabase")
cur = conn.cursor()
cur.execute("SELECT * FROM mytable;")
At this point the program starts consuming memory.
I had a look and the Postgresql process is behaving well. It is using a fair bit of CPU, which is fine, and a very limited amount of memory.
I was expecting psycopg2 to return an iterator without trying to buffer all of the results from the select. I could then use cur.fetchone()
repeatedly to process all rows.
So, how do I select from a 380M row table without using up available memory?
Answers:
You can use server side cursors.
cur = conn.cursor('cursor-name') # server side cursor
cur.itersize = 10000 # how much records to buffer on a client
cur.execute("SELECT * FROM mytable;")
Another way to use server side cursors:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 20000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
# process row
Psycopg2 will fetch itersize
rows to the client at a time. Once the for
loop exhausts that batch, it will fetch the next one.
I am using psycopg2 to query a Postgresql database and trying to process all rows from a table with about 380M rows. There are only 3 columns (id1, id2, count) all of type integer. However, when I run the straightforward select query below, the Python process starts consuming more and more memory, until it gets killed by the OS.
Minimal working example (assuming that mydatabase exists and contains a table called mytable):
import psycopg2
conn = psycopg2.connect("dbname=mydatabase")
cur = conn.cursor()
cur.execute("SELECT * FROM mytable;")
At this point the program starts consuming memory.
I had a look and the Postgresql process is behaving well. It is using a fair bit of CPU, which is fine, and a very limited amount of memory.
I was expecting psycopg2 to return an iterator without trying to buffer all of the results from the select. I could then use cur.fetchone()
repeatedly to process all rows.
So, how do I select from a 380M row table without using up available memory?
You can use server side cursors.
cur = conn.cursor('cursor-name') # server side cursor
cur.itersize = 10000 # how much records to buffer on a client
cur.execute("SELECT * FROM mytable;")
Another way to use server side cursors:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 20000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
# process row
Psycopg2 will fetch itersize
rows to the client at a time. Once the for
loop exhausts that batch, it will fetch the next one.