Python does cursor execute load all data
Question:
I am trying to query a large data (10 million rows) and try to prevent out of memory, but not familiar with Python and confused with different opinions regarding the execute(), cursor iterator and fetchone()
Am I right to assume that cursor.execute() does not load all data into memory and only when I call fetchone() then it will load 1 row of data
from mysql.connector import MySQLConnection
def query():
conn = MySQLConnection(host=conf['host'],
conf['port'],
conf['user'],
conf['password'],
conf['database'])
cursor = conn.cursor(buffered=True)
cursor.execute('SELECT * FROM TABLE') # 10 million rows
does this cursor iterator does the same with fetchone() ?
for row in cursor:
print(row)
is my code snippet is safe to handle 10 million rows of data? if not, how can I safely iterate the data without out of memory?
Answers:
Taken from MySQL documentation:
The fetchone()
method is used by fetchall()
and fetchmany()
. It is also used when a cursor is used as an iterator.
The following example shows two equivalent ways to process a query result. The first uses fetchone()
in a while loop, the second uses the cursor as an iterator:
# Using a while loop
cursor.execute("SELECT * FROM employees")
row = cursor.fetchone()
while row is not None:
print(row)
row = cursor.fetchone()
# Using the cursor as iterator
cursor.execute("SELECT * FROM employees")
for row in cursor:
print(row)
It also stated that:
You must fetch all rows for the current query before executing new statements using the same connection.
If you are worried about performance issues you should use fetchmany(n)
in a while loop until you fetch all of the results like so:
'An iterator that uses fetchmany to keep memory usage down'
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
This behavior adheres to PEP249, which describes how and which methods db connectors should implement. A partial answer is given in this thread.
Basically the implementation of fetchall
vs fetchmany
vs fetchone
would be up to the developers of the library depending on the database capabilities, but it would make sense, in the case of fetchmany
and fetchone
, that the unfetched/remaining results would be kept server side, until requested by another call or destruction of cursor object.
So in conclusion I think it is safe to assume calling execute method does not, in this case(mysqldb), dump all the data from the query to memory.
My first suggestion is to use from mysql.connector import connect
, which by the default will use the C extension (CMySQLConnection
), instead of from mysql.connector import MySQLConnection
(pure Python).
If you for some reason want to use the pure Python version, you can pass use_pure=True
in connect()
The second suggestion is to paginate the results, if you use a buffered cursor, it will fetch the entire result set from the server. I don’t know if you want that with 10M rows.
Here’s some references:
https://dev.mysql.com/doc/refman/8.0/en/limit-optimization.html
https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursorbuffered.html
I am trying to query a large data (10 million rows) and try to prevent out of memory, but not familiar with Python and confused with different opinions regarding the execute(), cursor iterator and fetchone()
Am I right to assume that cursor.execute() does not load all data into memory and only when I call fetchone() then it will load 1 row of data
from mysql.connector import MySQLConnection
def query():
conn = MySQLConnection(host=conf['host'],
conf['port'],
conf['user'],
conf['password'],
conf['database'])
cursor = conn.cursor(buffered=True)
cursor.execute('SELECT * FROM TABLE') # 10 million rows
does this cursor iterator does the same with fetchone() ?
for row in cursor:
print(row)
is my code snippet is safe to handle 10 million rows of data? if not, how can I safely iterate the data without out of memory?
Taken from MySQL documentation:
The
fetchone()
method is used byfetchall()
andfetchmany()
. It is also used when a cursor is used as an iterator.
The following example shows two equivalent ways to process a query result. The first uses fetchone()
in a while loop, the second uses the cursor as an iterator:
# Using a while loop
cursor.execute("SELECT * FROM employees")
row = cursor.fetchone()
while row is not None:
print(row)
row = cursor.fetchone()
# Using the cursor as iterator
cursor.execute("SELECT * FROM employees")
for row in cursor:
print(row)
It also stated that:
You must fetch all rows for the current query before executing new statements using the same connection.
If you are worried about performance issues you should use fetchmany(n)
in a while loop until you fetch all of the results like so:
'An iterator that uses fetchmany to keep memory usage down'
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
This behavior adheres to PEP249, which describes how and which methods db connectors should implement. A partial answer is given in this thread.
Basically the implementation of fetchall
vs fetchmany
vs fetchone
would be up to the developers of the library depending on the database capabilities, but it would make sense, in the case of fetchmany
and fetchone
, that the unfetched/remaining results would be kept server side, until requested by another call or destruction of cursor object.
So in conclusion I think it is safe to assume calling execute method does not, in this case(mysqldb), dump all the data from the query to memory.
My first suggestion is to use from mysql.connector import connect
, which by the default will use the C extension (CMySQLConnection
), instead of from mysql.connector import MySQLConnection
(pure Python).
If you for some reason want to use the pure Python version, you can pass use_pure=True
in connect()
The second suggestion is to paginate the results, if you use a buffered cursor, it will fetch the entire result set from the server. I don’t know if you want that with 10M rows.
Here’s some references:
https://dev.mysql.com/doc/refman/8.0/en/limit-optimization.html
https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursorbuffered.html