How to get a row-by-row MySQL ResultSet in python

Question:

MySQL ResultSets are by default retrieved completely from the server before any work can be done. In cases of huge result sets this becomes unusable. I would like instead to actually retrieve the rows one by one from the server.

In Java, following the instructions here (under “ResultSet”), I create a statement like this:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
              java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

This works nicely in Java. My question is: is there a way to do the same in python?

One thing I tried is to limit the query to a 1000 rows at a time, like this:

start_row = 0
while True:
    cursor = conn.cursor()
    cursor.execute("SELECT item FROM items LIMIT %d,1000" % start_row)
    rows = cursor.fetchall()
    if not rows:
        break
    start_row += 1000
    # Do something with rows...

However, this seems to get slower the higher start_row is.

And no, using fetchone() instead of fetchall() doesn’t change anything.

Clarification:

The naive code I use to reproduce this problem looks like this:

import MySQLdb

conn = MySQLdb.connect(user="user", passwd="password", db="mydb")
cur = conn.cursor()
print "Executing query"
cur.execute("SELECT * FROM bigtable");

print "Starting loop"
row = cur.fetchone()
while row is not None:
    print ", ".join([str(c) for c in row])
    row = cur.fetchone()

cur.close()
conn.close()

On a ~700,000 rows table, this code runs quickly. But on a ~9,000,000 rows table it prints “Executing Query” and then hangs for a long long time. That is why it makes no difference if I use fetchone() or fetchall().

Asked By: itsadok

||

Answers:

Did you try this version of fetchone? Or something different?

row = cursor.fetchone() 
while row is not None:
    # process
    row = cursor.fetchone()

Also, did you try this?

 row = cursor.fetchmany(size=1)
 while row is not None:
     # process
     row = cursor.fetchmany( size=1 )

Not all drivers support these, so you may have gotten errors or found them too slow.


Edit.

When it hangs on execute, you’re waiting for the database. That’s not a row-by-row Python thing; that’s a MySQL thing.

MySQL prefers to fetch all rows as part of it’s own cache management. This is turned off by providing a the fetch_size of Integer.MIN_VALUE (-2147483648L).

The question is, what part of the Python DBAPI becomes the equivalent of the JDBC fetch_size?

I think it might be the arraysize attribute of the cursor. Try

cursor.arraysize=-2**31

And see if that forces MySQL to stream the result set instead of caching it.

Answered By: S.Lott

I think you have to connect passing cursorclass = MySQLdb.cursors.SSCursor:

 MySQLdb.connect(user="user", 
                 passwd="password",
                 db="mydb",
                 cursorclass = MySQLdb.cursors.SSCursor
                )

The default cursor fetches all the data at once, even if you don’t use fetchall.

Edit: SSCursor or any other cursor class that supports server side resultsets – check the module docs on MySQLdb.cursors.

Answered By: RafaƂ Dowgird

The limit/offset solution runs in quadratic time because mysql has to rescan the rows to find the offset. As you suspected, the default cursor stores the entire result set on the client, which may consume a lot of memory.

Instead you can use a server side cursor, which keeps the query running and fetches results as necessary. The cursor class can be customized by supplying a default to the connection call itself, or by supplying a class to the cursor method each time.

from MySQLdb import cursors
cursor = conn.cursor(cursors.SSCursor)

But that’s not the whole story. In addition to storing the mysql result, the default client-side cursor actually fetches every row regardless. This behavior is undocumented, and very unfortunate. It means full python objects are created for all rows, which consumes far more memory than the original mysql result.

In most cases, a result stored on the client wrapped as an iterator would yield the best speed with reasonable memory usage. But you’ll have to roll your own if you want that.

Answered By: A. Coady

Try to use MySQLdb.cursors.SSDictCursor

con = MySQLdb.connect(host=host,
                  user=user,
                  passwd=pwd,
                  charset=charset,
                  port=port,
                  cursorclass=MySQLdb.cursors.SSDictCursor);
cur = con.cursor()
cur.execute("select f1, f2 from table")
for row in cur:
    print row['f1'], row['f2']
Answered By: hahakubile

I found the best results mixing a bit from some of the other answers.

This included setting cursorclass=MySQLdb.cursors.SSDictCursor (for MySQLdb) or pymysql.cursors.SSDictCursor (for PyMySQL) as part of the connection settings. This will let the server hold the query/results (the “SS” stands for server side as opposed to the default cursor which brings the results client side) and build a dictionary out of each row (e.g. {‘id’: 1, ‘name’: ‘Cookie Monster’}).

Then to loop through the rows, there was an infinite loop in both Python 2.7 and 3.4 caused by while rows is not None because even when cur.fetchmany(size=10000) was called and there were no results left, the method returned an empty list ([]) instead of None.

Actual example:

query = """SELECT * FROM my_table"""
conn = pymysql.connect(host=MYSQL_CREDENTIALS['host'], user=MYSQL_CREDENTIALS['user'],
                          passwd=MYSQL_CREDENTIALS['passwd'], charset='utf8', cursorclass = pymysql.cursors.SSDictCursor)
cur = conn.cursor()
results = cur.execute(query)
rows = cur.fetchmany(size=100)
while rows:
    for row in rows: 
        process(row)
    rows = cur.fetchmany(size=100)
cur.close()
conn.close()
Answered By: Garren S
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.