Generate SQL statements from a Pandas Dataframe

Question:

I am loading data from various sources (csv, xls, json etc…) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way to do this?

I know pandas has a to_sql function, but that only works on a database connection, it can not generate a string.

Example

What I would like is to take a dataframe like so:

import pandas as pd
import numpy as np

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

And a function that would generate this (this example is PostgreSQL but any would be fine):

CREATE TABLE data
(
  index timestamp with time zone,
  "A" double precision,
  "B" double precision,
  "C" double precision,
  "D" double precision
)
Asked By: Jorick Spitzen

||

Answers:

If you only want the ‘CREATE TABLE’ sql code (and not the insert of the data), you can use the get_schema function of the pandas.io.sql module:

In [10]: print pd.io.sql.get_schema(df.reset_index(), 'data')
CREATE TABLE "data" (
  "index" TIMESTAMP,
  "A" REAL,
  "B" REAL,
  "C" REAL,
  "D" REAL
)

Some notes:

  • I had to use reset_index because it otherwise didn’t include the index
  • If you provide an sqlalchemy engine of a certain database flavor, the result will be adjusted to that flavor (eg the data type names).
Answered By: joris

If you want to write the file by yourself, you may also retrieve columns names and dtypes and build a dictionary to convert pandas data types to sql data types.

As an example:

import pandas as pd
import numpy as np

dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))

tableName = 'table'
columnNames = df.columns.values.tolist()
columnTypes =  map(lambda x: x.name, df.dtypes.values)

# Storing column names and dtypes in a dataframe

tableDef = pd.DataFrame(index = range(len(df.columns) + 1), columns=['cols', 'dtypes'])

tableDef.iloc[0]           = ['index', df.index.dtype.name]
tableDef.loc[1:, 'cols']   = columnNames
tableDef.loc[1:, 'dtypes'] = columnTypes

# Defining a dictionnary to convert dtypes

conversion = {'datetime64[ns]':'timestamp with time zone', 'float64':'double precision'}

# Writing sql in a file

f = open('yourdir%s.sql' % tableName, 'w')

f.write('CREATE TABLE %sn' % tableName)
f.write('(n')

for i, row in tableDef.iterrows():
    sep = ",n" if i < tableDef.index[-1] else "n"
    f.write('t"%s" %s%s' % (row['cols'], conversion[row['dtypes']], sep))

f.write(')')

f.close()

You can do the same way to populate your table with INSERT INTO.

Answered By: Delforge

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

SOURCE = df
TARGET = data

GENERATE SQL CREATE STATEMENT FROM DATAFRAME

def SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):

# SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET)
# SOURCE: source dataframe
# TARGET: target table to be created in database

    import pandas as pd
    sql_text = pd.io.sql.get_schema(SOURCE.reset_index(), TARGET)   
    return sql_text

Check the SQL CREATE TABLE Statement String

print('nn'.join(sql_text))

GENERATE SQL INSERT STATEMENT FROM DATAFRAME

def SQL_INSERT_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
    sql_texts = []
    for index, row in SOURCE.iterrows():       
        sql_texts.append('INSERT INTO '+TARGET+' ('+ str(', '.join(SOURCE.columns))+ ') VALUES '+ str(tuple(row.values)))        
    return sql_texts

Check the SQL INSERT INTO Statement String

print('nn'.join(sql_texts))
Answered By: Jansen Simanullang

Taking the user @Jaris’s post to get the CREATE, I extended it further to work for any CSV

import sqlite3
import pandas as pd

db = './database.db'
csv = './data.csv'
table_name = 'data'

# create db and setup schema
df = pd.read_csv(csv)
create_table_sql = pd.io.sql.get_schema(df.reset_index(), table_name)
conn = sqlite3.connect(db)
c = conn.cursor()
c.execute(create_table_sql)
conn.commit()


# now we can insert data
def insert_data(row, c):
    values = str(row.name)+','+','.join([str('"'+str(v)+'"') for v in row])
    sql_insert=f"INSERT INTO {table_name} VALUES ({values})"

    try:
        c.execute(sql_insert)
    except Exception as e:
        print(f"SQL:{sql_insert} n failed with Error:{e}")



# use apply to loop over dataframe and call insert_data on each row
df.apply(lambda row: insert_data(row, c), axis=1)

# finally commit all those inserts into the database
conn.commit()

Hopefully this is more simple than the alternative answers and more pythonic!

Answered By: robmsmt

Insert Statement Solution

Not sure if this is the absolute best way to do it but this is more efficient than using df.iterrows() as that is very slow. Also this takes care of nan values with the help of regular expressions.

import re

def get_insert_query_from_df(df, dest_table):

    insert = """
    INSERT INTO `{dest_table}` (
        """.format(dest_table=dest_table)

    columns_string = str(list(df.columns))[1:-1]
    columns_string = re.sub(r' ', 'n        ', columns_string)
    columns_string = re.sub(r''', '', columns_string)

    values_string = ''

    for row in df.itertuples(index=False,name=None):
        values_string += re.sub(r'nan', 'null', str(row))
        values_string += ',n'

    return insert + columns_string + ')n     VALUESn' + values_string[:-2] + ';'
Answered By: hunterm

SINGLE INSERT QUERY SOLUTION

I didn’t find the above answers to suit my needs. I wanted to create one single insert statement for a dataframe with each row as the values. This can be achieved by the below:

import re 
import pandas as pd 

table = 'your_table_name_here'

# You can read from CSV file here... just using read_sql_query as an example

df = pd.read_sql_query(f'select * from {table}', con=db_connection)


cols = ', '.join(df.columns.to_list()) 
vals = []

for index, r in df.iterrows():
    row = []
    for x in r:
        row.append(f"'{str(x)}'")

    row_str = ', '.join(row)
    vals.append(row_str)

f_values = [] 
for v in vals:
    f_values.append(f'({v})')

# Handle inputting NULL values
f_values = ', '.join(f_values) 
f_values = re.sub(r"('None')", "NULL", f_values)

sql = f"insert into {table} ({cols}) values {f_values};" 

print(sql)

db.dispose()
Answered By: rup

Depending on if you can forego generating an intermediate representation of the SQL statement; You can just outright execute the insert statement as well.

con.executemany("INSERT OR REPLACE INTO data (A, B, C, D) VALUES (?, ?, ?, ?, ?)", list(df_.values))

This worked a little better as there is less messing around with string generation.

Answered By: 0x00

If you’re just looking to generate a string with inserts based on pandas.DataFrame – I’d suggest using bulk sql insert syntax as suggested by @rup.

Here’s an example of a function I wrote for that purpose:

import pandas as pd
import re


def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
    """Converts DataFrame to bulk INSERT sql query
    >>> data = [(1, "_suffixnan", 1), (2, "Noneprefix", 0), (3, "fooNULLbar", 1, 2.34)]
    >>> df = pd.DataFrame(data, columns=["id", "name", "is_deleted", "balance"])
    >>> df
       id        name  is_deleted  balance
    0   1  _suffixnan           1      NaN
    1   2  Noneprefix           0      NaN
    2   3  fooNULLbar           1     2.34
    >>> query = df_to_sql_bulk_insert(df, "users", status="APPROVED", address=None)
    >>> print(query)
    INSERT INTO users (id, name, is_deleted, balance, status, address)
    VALUES (1, '_suffixnan', 1, NULL, 'APPROVED', NULL),
           (2, 'Noneprefix', 0, NULL, 'APPROVED', NULL),
           (3, 'fooNULLbar', 1, 2.34, 'APPROVED', NULL);
    """
    df = df.copy().assign(**kwargs)
    columns = ", ".join(df.columns)
    tuples = map(str, df.itertuples(index=False, name=None))
    values = re.sub(r"(?<=W)(nan|None)(?=W)", "NULL", (",n" + " " * 7).join(tuples))
    return f"INSERT INTO {table} ({columns})nVALUES {values};"

By the way, it converts nan/None entries to NULL and it’s possible to pass constant column=value pairs as keyword arguments (see status="APPROVED" and address=None arguments in docstring example).

Generally, it works faster since any database does a lot of work for a single insert: checking the constraints, building indices, flushing, writing to log, etc. This complex operations can be optimized by the database when doing several-in-one operation, and not calling the engine one-by-one.

Answered By: absoup

The solution I used was to send the dataframe to a DB in memory, using SQLite3.

After that I dump the DB, writing the statements in a .sql file.

… just for demo I created an example file:

from datetime import datetime

import pandas as pd
import seaborn as sns
from sqlalchemy import create_engine


# Load Dataset
dataset_name = 'iris'
df = sns.load_dataset(dataset_name)

# Add Name to Index
df.index.name = 'Id'

# Results
df.head()

We create an engine using SQL Alchemy. This connection will be used by pandas, to send the data to the temporary memory, and also by SQLite3, to dump the contents of the database.

# Create Engine with SQL Alchemy (used by pandas)
engine = create_engine(f'sqlite://', echo=False)

# Send data to temporary SQLite3
df.to_sql(name=dataset_name, index=True, con=engine, if_exists='replace')

Finally, we indicate the path to the output file and do the iterdump.

# Output file
output_file = f'sql - {dataset_name}.sql'

# Para cada
with open(output_file, 'w') as f:
    # Date
    data_agora = datetime.today().strftime('%Y.%m.%d %H:%M:%S')
    
    f.write(
        '/****** Query para criação e inserção de registros no DB ******/n'
    )
    f.write('/*n')
    f.write(f'São {len(df)} registrosn')
    f.write(f'Obtidos na tabela "{dataset_name}"n')
    f.write('n')
    f.write(f'Query feita por Michel Metran em {(data_agora)},n')
    f.write('*/n')
    f.write('rn')    
    
    with engine.connect() as conn:
        for line in conn.connection.iterdump():            
            f.write(f'{line}n')
            print(line)
    
    # Close Connection
    conn.close()

To make life easier, I created a function inside a package that I maintain, called "traquitanas", with the function, being necessary to install the package and use the function:

#!pip3 install traquitanas --upgrade
from traquitanas.data import convert_to_sql

convert_to_sql.convert_dataframe_to_sql(df, output_file, dataset_name)
Answered By: Michel Metran
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.