Why would I get a memory error with fast_executemany on a tiny df?

Question:

I was looking for ways to speed up pushing a dataframe to sql server and stumbled upon an approach here. This approach blew me away in terms of speed. Using normal to_sql took almost 2 hours and this script was done in 12.54 seconds to push a 100k row X 100 column df.

So after testing the code below with a sample df, I attempted to use a df that had many different datatypes (int, string, floats, Booleans). However, I was sad to see a memory error. So I started reducing the size of my df to to see what the limitations were. I noticed that if my df had any strings then I wasn’t able to load to sql server. I am having trouble isolating the issue further. The script below is taken from the question in the link, however, I added a tiny df with strings. Any suggestions on how to rectify this issue would be great!

import pandas as pd
import numpy as np
import time
from sqlalchemy import create_engine, event
from urllib.parse import quote_plus
import pyodbc

conn =  "DRIVER={SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"
quoted = quote_plus(conn)
new_con = 'mssql+pyodbc:///?odbc_connect={}'.format(quoted)
engine = create_engine(new_con)


@event.listens_for(engine, 'before_cursor_execute')
def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    print("FUNC call")
    if executemany:
        cursor.fast_executemany = True


table_name = 'fast_executemany_test'
df1 = pd.DataFrame({'col1':['tyrefdg','ertyreg','efdgfdg'],
                   'col2':['tydfggfdgrefdg','erdfgfdgfdgfdgtyreg','edfgfdgdfgdffdgfdg']
                   })



s = time.time()
df1.to_sql(table_name, engine, if_exists = 'replace', chunksize = None)
print(time.time() - s)
Asked By: CandleWax

||

Answers:

I was able to reproduce your issue using pyodbc 4.0.23. The MemoryError was related to your use of the ancient

DRIVER={SQL Server}

Further testing using

DRIVER=ODBC Driver 11 for SQL Server

also failed, with

Function sequence error (0) (SQLParamData)

which was related to an existing pyodbc issue on GitHub. I posted my findings here.

That issue is still under investigation. In the meantime you might be able to proceed by

  • using a newer ODBC driver like DRIVER=ODBC Driver 13 for SQL Server, and
  • running pip install pyodbc==4.0.22 to use an earlier version of pyodbc.
Answered By: Gord Thompson

I experienced this issue with 32-bit Python and switched my interpreter to 64-bit which resolved my memory issue. Outside of that solution, I would recommend chunking the amount of data you process. You can establish your threshold and once that threshold is met you can process that chunk of data and iterate through until you have processed all of the data.

Answered By: Ian Maurer

I was facing thi issue too.

Solved by changing:

conn =  "DRIVER={SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"

To this:

conn =  "DRIVER={ODBC Driver 17 for SQL Server};SERVER=SERVER_IP;DATABASE=DB_NAME;UID=USER_ID;PWD=PWD"
Answered By: Hugo Vares

I have bumped into this exact issue a bunch of times, and I’m not seeing the answer that always solves it for me, so here we go. It turns out that passing the variable types is very important when using fast_executemany=True.

Particularly, you will run into a memory error with string fields if you do not pass a maximum number of characters in your type to to_sql. What seems to be going on is that if you don’t include a maximum number of characters that can be entered into the string field, sqlalchemy will devote as much memory as it can to each row in the column. So if you have a sizeable amount of data you quickly run out of memory on your machine. The fix is to pass a dictionary to the dtype parameter.

Here is the a snippet to give you an idea what I’m talking about:

conn = 'mssql+pyodbc://server/db?driver=SQL+SERVER'
engine = create_engine(conn,fast_executemany=True)
typing = {
          'col1':sqlalchemy.types.String(5),
          'col2':sqlalchemy.types.String(5),
          }
df.to_sql('Table', engine, dtype=typing)

I’ve also found that for large database entries passing a chunksize parameter is also helpful, because otherwise my server times out, but that is a problem for another answer, on another day.

Answered By: Ryan