python postgres can I fetchall() 1 million rows?

Question:

I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.

I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)

q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
    doSomething(row)

what is the smarter way to do this?

Asked By: Medya Gh

||

Answers:

Consider using server side cursor:

When a database query is executed, the Psycopg cursor usually fetches
all the records returned by the backend, transferring them to the
client process. If the query returned an huge amount of data, a
proportionally large amount of memory will be allocated by the client.

If the dataset is too large to be practically handled on the client
side, it is possible to create a server side cursor. Using this kind
of cursor it is possible to transfer to the client only a controlled
amount of data, so that a large dataset can be examined without
keeping it entirely in memory.

Here’s an example:

cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
    cursor.execute("FETCH 1000 FROM super_cursor")
    rows = cursor.fetchall()

    if not rows:
        break

    for row in rows:
        doSomething(row)
Answered By: alecxe

fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:

row = cur.fetchone()
while row:
   # do something with row
   row = cur.fetchone()
Answered By: Burhan Khalid

The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:

row = cursor.fetchone()

However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.

Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:

cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)

while True:
    rows = cursor.fetchmany(5000)
    if not rows:
        break

    for row in rows:
        # do something with row
        pass

you find more about server side cursors in the psycopg2 wiki

Answered By: linqu

Here is the code to use for simple server side cursor with the speed of fetchmany management.

The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().

With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.

Answered By: Le Droid

EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)

This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don’t know why it’s not there.


Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.

The fetchmany loop is tedious but I think it’s the best solution here. To make life easier, you can use the following:

from functools import partial
from itertools import chain

# from_iterable added >= python 2.7
from_iterable = chain.from_iterable

# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
    if parms is None:
        curs.execute(sql)
    else:
        curs.execute(sql, parms)
    chunks_until_empty = iter(partial(fetchmany, chunksize), [])
    return from_iterable(chunks_until_empty)

# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
    print 'lots of waffles: %s' % (row,)
Answered By: parity3

As I was reading comments and answers I thought I should clarify something about fetchone and Server-side cursors for future readers.

With normal cursors (client-side), Psycopg fetches all the records returned by the backend, transferring them to the client process. The whole records are in the client’s memory. It is when you execute a query like curs.execute('SELECT * FROM ...'.

All the fetch* methods are there for accessing this stored data.

Q: So how fetchone can help us memory wise ?
A: It fetches only one record from the stored data and creates a single Python object and hands you in your Python code while fetchall will fetch and create n Python objects from this data and hands it to you all in one chunk.

So If your table has 1,000,000 records, this is what’s going on in memory:

curs.execute --> whole 1,000,000 result set  +  fetchone --> 1 Python object
curs.execute --> whole 1,000,000 result set  +  fetchall --> 1,000,000 Python objects

Of-course fetchone helped but still we have the whole records in memory. This is where Server-side cursors comes into play:

PostgreSQL also has its own concept of cursor (sometimes also called
portal). When a database cursor is created, the query is not
necessarily completely processed: the server might be able to produce
results only as they are needed. Only the results requested are
transmitted to the client: if the query result is very large but the
client only needs the first few records it is possible to transmit
only them.

their interface is the same, but behind the scene they
send commands to control the state of the cursor on the server (for
instance when fetching new records or when moving using scroll()).

So you won’t get the whole result set in one chunk.

The draw-back :

The downside is that the server needs to keep track of the partially
processed results, so it uses more memory and resources on the server.

Answered By: S.B