Bulk insert huge data into SQLite using Python

Question:

I read this: Importing a CSV file into a sqlite3 database table using Python

and it seems that everyone suggests using line-by-line reading instead of using bulk .import from SQLite. However, that will make the insertion really slow if you have millions of rows of data. Is there any other way to circumvent this?

Update: I tried the following code to insert line by line but the speed is not as good as I expected. Is there anyway to improve it

for logFileName in allLogFilesName:
    logFile = codecs.open(logFileName, 'rb', encoding='utf-8')
    for logLine in logFile:
        logLineAsList = logLine.split('t')
        output.execute('''INSERT INTO log VALUES(?, ?, ?, ?)''', logLineAsList)
    logFile.close()
connection.commit()
connection.close()
Asked By: Shar

||

Answers:

Sqlite can do tens of thousands of inserts per second, just make sure to do all of them in a single transaction by surrounding the inserts with BEGIN and COMMIT. (executemany() does this automatically.)

As always, don’t optimize before you know speed will be a problem. Test the easiest solution first, and only optimize if the speed is unacceptable.

Answered By: davidfg4

Divide your data into chunks on the fly using generator expressions, make inserts inside the transaction. Here’s a quote from sqlite optimization FAQ:

Unless already in a transaction, each SQL statement has a new
transaction started for it. This is very expensive, since it requires
reopening, writing to, and closing the journal file for each
statement. This can be avoided by wrapping sequences of SQL statements
with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup
is also obtained for statements which don’t alter the database.

Here’s how your code may look like.

Also, sqlite has an ability to import CSV files.

Answered By: alecxe

Since this is the top result on a Google search I thought it might be nice to update this question.

From the python sqlite docs you can use

import sqlite3

persons = [
    ("Hugo", "Boss"),
    ("Calvin", "Klein")
]

con = sqlite3.connect(":memory:")

# Create the table
con.execute("create table person(firstname, lastname)")

# Fill the table
con.executemany("insert into person(firstname, lastname) values (?,?)", persons)

I have used this method to commit over 50k row inserts at a time and it’s lightning fast.

Answered By: Fred

cursor.execute vs cursor.executemany minimal synthetic benchmark

main.py

from pathlib import Path
import sqlite3
import csv

f = 'tmp.sqlite'
n = 10000000
Path(f).unlink(missing_ok=True)
connection = sqlite3.connect(f)
#connection.set_trace_callback(print)
cursor = connection.cursor()
cursor.execute("CREATE TABLE t (x integer)")
#for i in range(n):
#    cursor.execute(f"INSERT INTO t VALUES ({i})")
cursor.executemany(f"INSERT INTO t VALUES (?)", ((str(i),) for i in range(n)))
connection.commit()
connection.close()

Results according to time main.py:

method storage time (s)
executemany SSD 8.6
executemany :memory: 9.8
executemany HDD 10.4
execute SSD 31
execute :memory: 29

Baseline time of for i in range(n): pass: 0.3s.

Conclusions:

  • executemany is about 3x faster
  • we are not I/O bound on execute vs executemany, we are likely memory bound. This might not be very surprising given that the final tmp.sqlite file is only 115 MB, and that my SSD can handle 3 GB/s

If I enable connection.set_trace_callback(print) to log queries as per: How can I log queries in Sqlite3 with Python? and reduce n = 5 I see the exact same queries for both execute and executemany:

CREATE TABLE t (x integer)
BEGIN 
INSERT INTO t VALUES ('0')
INSERT INTO t VALUES ('1')
INSERT INTO t VALUES ('2')
INSERT INTO t VALUES ('3')
INSERT INTO t VALUES ('4')
COMMIT

so it does not seem that the speed difference is linked to transactions, as both appear to execute within a single transaction. There are some comments on automatic transaction control at: https://docs.python.org/3/library/sqlite3.html#transaction-control but they are not super clear, I hope the logs are correct.

Insertion time baseline

I’ll be looking out for the fastest possible method I can find to compare it to Python. So far, this generate_series approach is the winner:

f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" 'create table t(x integer)'
time sqlite3 "$f" 'insert into t select value as x from generate_series(1,10000000)'

finished as fast as 1.4s on SSD, therefore substantially faster than any Python method so far, and well beyond being SSD-bound.

Tested on Ubuntu 23.04, Python 3.11.2, Lenovo ThinkPad P51

  • SSD: Samsung MZVLB512HAJQ-000L7 512GB SSD, 3 GB/s nominal speed
  • HDD: Seagate ST1000LM035-1RK1 1TB, 140 MB/s nominal speed
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.