Generate SQL statements from a Pandas Dataframe
Question:
I am loading data from various sources (csv, xls, json etc…) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way to do this?
I know pandas has a to_sql
function, but that only works on a database connection, it can not generate a string.
Example
What I would like is to take a dataframe like so:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
And a function that would generate this (this example is PostgreSQL but any would be fine):
CREATE TABLE data
(
index timestamp with time zone,
"A" double precision,
"B" double precision,
"C" double precision,
"D" double precision
)
Answers:
If you only want the ‘CREATE TABLE’ sql code (and not the insert of the data), you can use the get_schema
function of the pandas.io.sql module:
In [10]: print pd.io.sql.get_schema(df.reset_index(), 'data')
CREATE TABLE "data" (
"index" TIMESTAMP,
"A" REAL,
"B" REAL,
"C" REAL,
"D" REAL
)
Some notes:
- I had to use
reset_index
because it otherwise didn’t include the index
- If you provide an sqlalchemy engine of a certain database flavor, the result will be adjusted to that flavor (eg the data type names).
If you want to write the file by yourself, you may also retrieve columns names and dtypes and build a dictionary to convert pandas data types to sql data types.
As an example:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
tableName = 'table'
columnNames = df.columns.values.tolist()
columnTypes = map(lambda x: x.name, df.dtypes.values)
# Storing column names and dtypes in a dataframe
tableDef = pd.DataFrame(index = range(len(df.columns) + 1), columns=['cols', 'dtypes'])
tableDef.iloc[0] = ['index', df.index.dtype.name]
tableDef.loc[1:, 'cols'] = columnNames
tableDef.loc[1:, 'dtypes'] = columnTypes
# Defining a dictionnary to convert dtypes
conversion = {'datetime64[ns]':'timestamp with time zone', 'float64':'double precision'}
# Writing sql in a file
f = open('yourdir%s.sql' % tableName, 'w')
f.write('CREATE TABLE %sn' % tableName)
f.write('(n')
for i, row in tableDef.iterrows():
sep = ",n" if i < tableDef.index[-1] else "n"
f.write('t"%s" %s%s' % (row['cols'], conversion[row['dtypes']], sep))
f.write(')')
f.close()
You can do the same way to populate your table with INSERT INTO.
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
SOURCE = df
TARGET = data
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
def SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
# SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET)
# SOURCE: source dataframe
# TARGET: target table to be created in database
import pandas as pd
sql_text = pd.io.sql.get_schema(SOURCE.reset_index(), TARGET)
return sql_text
Check the SQL CREATE TABLE
Statement String
print('nn'.join(sql_text))
GENERATE SQL INSERT STATEMENT FROM DATAFRAME
def SQL_INSERT_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
sql_texts = []
for index, row in SOURCE.iterrows():
sql_texts.append('INSERT INTO '+TARGET+' ('+ str(', '.join(SOURCE.columns))+ ') VALUES '+ str(tuple(row.values)))
return sql_texts
Check the SQL INSERT INTO
Statement String
print('nn'.join(sql_texts))
Taking the user @Jaris’s post to get the CREATE, I extended it further to work for any CSV
import sqlite3
import pandas as pd
db = './database.db'
csv = './data.csv'
table_name = 'data'
# create db and setup schema
df = pd.read_csv(csv)
create_table_sql = pd.io.sql.get_schema(df.reset_index(), table_name)
conn = sqlite3.connect(db)
c = conn.cursor()
c.execute(create_table_sql)
conn.commit()
# now we can insert data
def insert_data(row, c):
values = str(row.name)+','+','.join([str('"'+str(v)+'"') for v in row])
sql_insert=f"INSERT INTO {table_name} VALUES ({values})"
try:
c.execute(sql_insert)
except Exception as e:
print(f"SQL:{sql_insert} n failed with Error:{e}")
# use apply to loop over dataframe and call insert_data on each row
df.apply(lambda row: insert_data(row, c), axis=1)
# finally commit all those inserts into the database
conn.commit()
Hopefully this is more simple than the alternative answers and more pythonic!
Insert Statement Solution
Not sure if this is the absolute best way to do it but this is more efficient than using df.iterrows()
as that is very slow. Also this takes care of nan
values with the help of regular expressions.
import re
def get_insert_query_from_df(df, dest_table):
insert = """
INSERT INTO `{dest_table}` (
""".format(dest_table=dest_table)
columns_string = str(list(df.columns))[1:-1]
columns_string = re.sub(r' ', 'n ', columns_string)
columns_string = re.sub(r''', '', columns_string)
values_string = ''
for row in df.itertuples(index=False,name=None):
values_string += re.sub(r'nan', 'null', str(row))
values_string += ',n'
return insert + columns_string + ')n VALUESn' + values_string[:-2] + ';'
SINGLE INSERT QUERY SOLUTION
I didn’t find the above answers to suit my needs. I wanted to create one single insert statement for a dataframe with each row as the values. This can be achieved by the below:
import re
import pandas as pd
table = 'your_table_name_here'
# You can read from CSV file here... just using read_sql_query as an example
df = pd.read_sql_query(f'select * from {table}', con=db_connection)
cols = ', '.join(df.columns.to_list())
vals = []
for index, r in df.iterrows():
row = []
for x in r:
row.append(f"'{str(x)}'")
row_str = ', '.join(row)
vals.append(row_str)
f_values = []
for v in vals:
f_values.append(f'({v})')
# Handle inputting NULL values
f_values = ', '.join(f_values)
f_values = re.sub(r"('None')", "NULL", f_values)
sql = f"insert into {table} ({cols}) values {f_values};"
print(sql)
db.dispose()
Depending on if you can forego generating an intermediate representation of the SQL statement; You can just outright execute the insert statement as well.
con.executemany("INSERT OR REPLACE INTO data (A, B, C, D) VALUES (?, ?, ?, ?, ?)", list(df_.values))
This worked a little better as there is less messing around with string generation.
If you’re just looking to generate a string with inserts based on pandas.DataFrame
– I’d suggest using bulk sql insert syntax as suggested by @rup.
Here’s an example of a function I wrote for that purpose:
import pandas as pd
import re
def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
"""Converts DataFrame to bulk INSERT sql query
>>> data = [(1, "_suffixnan", 1), (2, "Noneprefix", 0), (3, "fooNULLbar", 1, 2.34)]
>>> df = pd.DataFrame(data, columns=["id", "name", "is_deleted", "balance"])
>>> df
id name is_deleted balance
0 1 _suffixnan 1 NaN
1 2 Noneprefix 0 NaN
2 3 fooNULLbar 1 2.34
>>> query = df_to_sql_bulk_insert(df, "users", status="APPROVED", address=None)
>>> print(query)
INSERT INTO users (id, name, is_deleted, balance, status, address)
VALUES (1, '_suffixnan', 1, NULL, 'APPROVED', NULL),
(2, 'Noneprefix', 0, NULL, 'APPROVED', NULL),
(3, 'fooNULLbar', 1, 2.34, 'APPROVED', NULL);
"""
df = df.copy().assign(**kwargs)
columns = ", ".join(df.columns)
tuples = map(str, df.itertuples(index=False, name=None))
values = re.sub(r"(?<=W)(nan|None)(?=W)", "NULL", (",n" + " " * 7).join(tuples))
return f"INSERT INTO {table} ({columns})nVALUES {values};"
By the way, it converts nan
/None
entries to NULL
and it’s possible to pass constant column=value pairs as keyword arguments (see status="APPROVED"
and address=None
arguments in docstring example).
Generally, it works faster since any database does a lot of work for a single insert: checking the constraints, building indices, flushing, writing to log, etc. This complex operations can be optimized by the database when doing several-in-one operation, and not calling the engine one-by-one.
The solution I used was to send the dataframe to a DB in memory, using SQLite3.
After that I dump the DB, writing the statements in a .sql file.
… just for demo I created an example file:
from datetime import datetime
import pandas as pd
import seaborn as sns
from sqlalchemy import create_engine
# Load Dataset
dataset_name = 'iris'
df = sns.load_dataset(dataset_name)
# Add Name to Index
df.index.name = 'Id'
# Results
df.head()
We create an engine using SQL Alchemy. This connection will be used by pandas, to send the data to the temporary memory, and also by SQLite3, to dump the contents of the database.
# Create Engine with SQL Alchemy (used by pandas)
engine = create_engine(f'sqlite://', echo=False)
# Send data to temporary SQLite3
df.to_sql(name=dataset_name, index=True, con=engine, if_exists='replace')
Finally, we indicate the path to the output file and do the iterdump.
# Output file
output_file = f'sql - {dataset_name}.sql'
# Para cada
with open(output_file, 'w') as f:
# Date
data_agora = datetime.today().strftime('%Y.%m.%d %H:%M:%S')
f.write(
'/****** Query para criação e inserção de registros no DB ******/n'
)
f.write('/*n')
f.write(f'São {len(df)} registrosn')
f.write(f'Obtidos na tabela "{dataset_name}"n')
f.write('n')
f.write(f'Query feita por Michel Metran em {(data_agora)},n')
f.write('*/n')
f.write('rn')
with engine.connect() as conn:
for line in conn.connection.iterdump():
f.write(f'{line}n')
print(line)
# Close Connection
conn.close()
To make life easier, I created a function inside a package that I maintain, called "traquitanas", with the function, being necessary to install the package and use the function:
#!pip3 install traquitanas --upgrade
from traquitanas.data import convert_to_sql
convert_to_sql.convert_dataframe_to_sql(df, output_file, dataset_name)
I am loading data from various sources (csv, xls, json etc…) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way to do this?
I know pandas has a to_sql
function, but that only works on a database connection, it can not generate a string.
Example
What I would like is to take a dataframe like so:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
And a function that would generate this (this example is PostgreSQL but any would be fine):
CREATE TABLE data
(
index timestamp with time zone,
"A" double precision,
"B" double precision,
"C" double precision,
"D" double precision
)
If you only want the ‘CREATE TABLE’ sql code (and not the insert of the data), you can use the get_schema
function of the pandas.io.sql module:
In [10]: print pd.io.sql.get_schema(df.reset_index(), 'data')
CREATE TABLE "data" (
"index" TIMESTAMP,
"A" REAL,
"B" REAL,
"C" REAL,
"D" REAL
)
Some notes:
- I had to use
reset_index
because it otherwise didn’t include the index - If you provide an sqlalchemy engine of a certain database flavor, the result will be adjusted to that flavor (eg the data type names).
If you want to write the file by yourself, you may also retrieve columns names and dtypes and build a dictionary to convert pandas data types to sql data types.
As an example:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
tableName = 'table'
columnNames = df.columns.values.tolist()
columnTypes = map(lambda x: x.name, df.dtypes.values)
# Storing column names and dtypes in a dataframe
tableDef = pd.DataFrame(index = range(len(df.columns) + 1), columns=['cols', 'dtypes'])
tableDef.iloc[0] = ['index', df.index.dtype.name]
tableDef.loc[1:, 'cols'] = columnNames
tableDef.loc[1:, 'dtypes'] = columnTypes
# Defining a dictionnary to convert dtypes
conversion = {'datetime64[ns]':'timestamp with time zone', 'float64':'double precision'}
# Writing sql in a file
f = open('yourdir%s.sql' % tableName, 'w')
f.write('CREATE TABLE %sn' % tableName)
f.write('(n')
for i, row in tableDef.iterrows():
sep = ",n" if i < tableDef.index[-1] else "n"
f.write('t"%s" %s%s' % (row['cols'], conversion[row['dtypes']], sep))
f.write(')')
f.close()
You can do the same way to populate your table with INSERT INTO.
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
SOURCE = df
TARGET = data
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
def SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
# SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET)
# SOURCE: source dataframe
# TARGET: target table to be created in database
import pandas as pd
sql_text = pd.io.sql.get_schema(SOURCE.reset_index(), TARGET)
return sql_text
Check the SQL CREATE TABLE
Statement String
print('nn'.join(sql_text))
GENERATE SQL INSERT STATEMENT FROM DATAFRAME
def SQL_INSERT_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
sql_texts = []
for index, row in SOURCE.iterrows():
sql_texts.append('INSERT INTO '+TARGET+' ('+ str(', '.join(SOURCE.columns))+ ') VALUES '+ str(tuple(row.values)))
return sql_texts
Check the SQL INSERT INTO
Statement String
print('nn'.join(sql_texts))
Taking the user @Jaris’s post to get the CREATE, I extended it further to work for any CSV
import sqlite3
import pandas as pd
db = './database.db'
csv = './data.csv'
table_name = 'data'
# create db and setup schema
df = pd.read_csv(csv)
create_table_sql = pd.io.sql.get_schema(df.reset_index(), table_name)
conn = sqlite3.connect(db)
c = conn.cursor()
c.execute(create_table_sql)
conn.commit()
# now we can insert data
def insert_data(row, c):
values = str(row.name)+','+','.join([str('"'+str(v)+'"') for v in row])
sql_insert=f"INSERT INTO {table_name} VALUES ({values})"
try:
c.execute(sql_insert)
except Exception as e:
print(f"SQL:{sql_insert} n failed with Error:{e}")
# use apply to loop over dataframe and call insert_data on each row
df.apply(lambda row: insert_data(row, c), axis=1)
# finally commit all those inserts into the database
conn.commit()
Hopefully this is more simple than the alternative answers and more pythonic!
Insert Statement Solution
Not sure if this is the absolute best way to do it but this is more efficient than using df.iterrows()
as that is very slow. Also this takes care of nan
values with the help of regular expressions.
import re
def get_insert_query_from_df(df, dest_table):
insert = """
INSERT INTO `{dest_table}` (
""".format(dest_table=dest_table)
columns_string = str(list(df.columns))[1:-1]
columns_string = re.sub(r' ', 'n ', columns_string)
columns_string = re.sub(r''', '', columns_string)
values_string = ''
for row in df.itertuples(index=False,name=None):
values_string += re.sub(r'nan', 'null', str(row))
values_string += ',n'
return insert + columns_string + ')n VALUESn' + values_string[:-2] + ';'
SINGLE INSERT QUERY SOLUTION
I didn’t find the above answers to suit my needs. I wanted to create one single insert statement for a dataframe with each row as the values. This can be achieved by the below:
import re
import pandas as pd
table = 'your_table_name_here'
# You can read from CSV file here... just using read_sql_query as an example
df = pd.read_sql_query(f'select * from {table}', con=db_connection)
cols = ', '.join(df.columns.to_list())
vals = []
for index, r in df.iterrows():
row = []
for x in r:
row.append(f"'{str(x)}'")
row_str = ', '.join(row)
vals.append(row_str)
f_values = []
for v in vals:
f_values.append(f'({v})')
# Handle inputting NULL values
f_values = ', '.join(f_values)
f_values = re.sub(r"('None')", "NULL", f_values)
sql = f"insert into {table} ({cols}) values {f_values};"
print(sql)
db.dispose()
Depending on if you can forego generating an intermediate representation of the SQL statement; You can just outright execute the insert statement as well.
con.executemany("INSERT OR REPLACE INTO data (A, B, C, D) VALUES (?, ?, ?, ?, ?)", list(df_.values))
This worked a little better as there is less messing around with string generation.
If you’re just looking to generate a string with inserts based on pandas.DataFrame
– I’d suggest using bulk sql insert syntax as suggested by @rup.
Here’s an example of a function I wrote for that purpose:
import pandas as pd
import re
def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
"""Converts DataFrame to bulk INSERT sql query
>>> data = [(1, "_suffixnan", 1), (2, "Noneprefix", 0), (3, "fooNULLbar", 1, 2.34)]
>>> df = pd.DataFrame(data, columns=["id", "name", "is_deleted", "balance"])
>>> df
id name is_deleted balance
0 1 _suffixnan 1 NaN
1 2 Noneprefix 0 NaN
2 3 fooNULLbar 1 2.34
>>> query = df_to_sql_bulk_insert(df, "users", status="APPROVED", address=None)
>>> print(query)
INSERT INTO users (id, name, is_deleted, balance, status, address)
VALUES (1, '_suffixnan', 1, NULL, 'APPROVED', NULL),
(2, 'Noneprefix', 0, NULL, 'APPROVED', NULL),
(3, 'fooNULLbar', 1, 2.34, 'APPROVED', NULL);
"""
df = df.copy().assign(**kwargs)
columns = ", ".join(df.columns)
tuples = map(str, df.itertuples(index=False, name=None))
values = re.sub(r"(?<=W)(nan|None)(?=W)", "NULL", (",n" + " " * 7).join(tuples))
return f"INSERT INTO {table} ({columns})nVALUES {values};"
By the way, it converts nan
/None
entries to NULL
and it’s possible to pass constant column=value pairs as keyword arguments (see status="APPROVED"
and address=None
arguments in docstring example).
Generally, it works faster since any database does a lot of work for a single insert: checking the constraints, building indices, flushing, writing to log, etc. This complex operations can be optimized by the database when doing several-in-one operation, and not calling the engine one-by-one.
The solution I used was to send the dataframe to a DB in memory, using SQLite3.
After that I dump the DB, writing the statements in a .sql file.
… just for demo I created an example file:
from datetime import datetime
import pandas as pd
import seaborn as sns
from sqlalchemy import create_engine
# Load Dataset
dataset_name = 'iris'
df = sns.load_dataset(dataset_name)
# Add Name to Index
df.index.name = 'Id'
# Results
df.head()
We create an engine using SQL Alchemy. This connection will be used by pandas, to send the data to the temporary memory, and also by SQLite3, to dump the contents of the database.
# Create Engine with SQL Alchemy (used by pandas)
engine = create_engine(f'sqlite://', echo=False)
# Send data to temporary SQLite3
df.to_sql(name=dataset_name, index=True, con=engine, if_exists='replace')
Finally, we indicate the path to the output file and do the iterdump.
# Output file
output_file = f'sql - {dataset_name}.sql'
# Para cada
with open(output_file, 'w') as f:
# Date
data_agora = datetime.today().strftime('%Y.%m.%d %H:%M:%S')
f.write(
'/****** Query para criação e inserção de registros no DB ******/n'
)
f.write('/*n')
f.write(f'São {len(df)} registrosn')
f.write(f'Obtidos na tabela "{dataset_name}"n')
f.write('n')
f.write(f'Query feita por Michel Metran em {(data_agora)},n')
f.write('*/n')
f.write('rn')
with engine.connect() as conn:
for line in conn.connection.iterdump():
f.write(f'{line}n')
print(line)
# Close Connection
conn.close()
To make life easier, I created a function inside a package that I maintain, called "traquitanas", with the function, being necessary to install the package and use the function:
#!pip3 install traquitanas --upgrade
from traquitanas.data import convert_to_sql
convert_to_sql.convert_dataframe_to_sql(df, output_file, dataset_name)