Multiple write single read SQLite application with Peewee

Question:

I’m using an SQLite database with peewee on multiple machines, and I’m encountering various OperationalError, DataBaseError. It’s obviously a problem of multithreading, but I’m not at all an expert with this nor with SQL. Here’s my setup and what I’ve tried.

Settings

I’m using peewee to log machine learning experiments. Basically, I have multiple nodes (like, different computers) which run a python file, and all write to the same base.db file in a shared location. On top of that, I need a single read access from my laptop, to see what’s going on. There are at most ~50 different nodes which instantiate the database and write things on it.

What I’ve tried

At first, I used the SQLite object:

db = pw.SqliteDatabase(None)

# ... Define tables Experiment and Epoch

def init_db(file_name: str):
    db.init(file_name)
    db.create_tables([Experiment, Epoch], safe=True)
    db.close()

def train():
    xp = Experiment.create(...)

    # Do stuff
    with db.atomic():  
        Epoch.bulk_create(...)
    xp.save()

This worked fine, but I sometimes had jobs which crashed because of the database being locked. Then, I learnt that SQLite only handled one write operation per connection, which caused the problem.

So I turned to SqliteQueueDatabase as, according to the documentation, it’s useful if "if you want simple read and write access to a SQLite database from multiple threads." I also added those keywords I found on other thread which were said to be useful.

The code then looked like this:

db = SqliteQueueDatabase(None, autostart=False, pragmas=[('journal_mode', 'wal')],
                         use_gevent=False,)

def init_db(file_name: str):
    db.init(file_name)
    db.start()
    db.create_tables([Experiment, Epoch], safe=True)
    db.connect()

and the same for saving stuff except for the db.atomic part. However, not only do write queries seem to encounter errors, I practically no longer have access to the database for read: it is almost always busy.

My question

What is the right object to use in this case? I thought SqliteQueueDatabase was the perfect fit. Are pooled database a better fit? I’m also asking this question because I don’t know if I have a good grasp on the threading part: the fact that multiple database object are initialized from multiple machines is different from having a single object on a single machine with multiple threads (like this situation). Right? Is there a good way to handle things then?

Sorry if this question is already answered in another place, and thanks for any help! Happy to provide more code if needed of course.

Asked By: gaspardbb

||

Answers:

Inded, After @BoarGules comment, I realize that I confused two very different things:

  • Having multiple threads on a single machine: here, SqliteQueueDatabase is a very good fit
  • Having multiple machines, with one or more threads: that’s basically how internet works.

So I ended up installing Postgre. A few links if it can be useful to people coming after me, for linux:

  • Install Postgre. You can build it from source if you don’t have root privilege following chapter 17 from the official documentation, then Chapter 19.
  • You can export an SQLite database with pgloader. But again, if you don’t have the right librairies and don’t want to build everything, you can do it by hand. I did the following, not sure if more straightforward solution exist.
  1. Export your tables as csv (following @coleifer’s comment):
models = [Experiment, Epoch]
for model in models:
    outfile = '%s.csv' % model._meta.table_name
    with open(outfile, 'w', newline='') as f:
        writer = csv.writer(f)
        row_iter = model.select().tuples().iterator()
        writer.writerows(row_iter)
  1. Create the table in the new Postgre database:
db = pw.PostgresqlDatabase('mydb', host='localhost')
db.create_tables([Experiment, Epoch], safe=True)
  1. Copy the CSV tables to Postgre db with the following command:
COPY epoch("col1", "col2", ...) FROM '/absolute/path/to/epoch.csv'; DELIMITER ',' CSV;

and likewise for the other tables.

IT worked fine for me, as I had only two tables. Can be annoying if you have more than that. pgloader seems a very good solution in that case, if you can install it easily.

Update

I could not create objects from peewee at first. I had integrity error: it seemed that the id which was returned by Postgre (with the RETURNING 'epoch'.'id' clause) was returning an already existing id. From my understanding, it was because the increment had not been called when using the COPY command. Thus, it only returned id 1, then 2, and so on until it reached an non existing id. To avoid going through all this failed creation, you can directly edit the iterator governing the RETURN clause, with:

ALTER SEQUENCE epoch_id_seq RESTART WITH 10000

and replace 10000 with the value from SELECT MAX("id") FROM epoch, +1.

Answered By: gaspardbb

Sqlite only supports a single writer at a time, but multiple readers can have the db open (even while a writer is connected) when using WAL-mode. For peewee you can enable wal mode:

db = SqliteDatabase('/path/to/db', pragmas={'journal_mode': 'wal'})

The other crucial thing, when using multiple writers, is to keep your write transactions as short as possible. Some suggestions can be found here: https://charlesleifer.com/blog/going-fast-with-sqlite-and-python/ under the "Transactions, Concurrency and Autocommit" heading.

Also note that SqliteQueueDatabase works well for a single process with multiple threads, but will not help you at all if you have multiple processes.

Answered By: coleifer

I think you can just increase the timeout for sqlite and be fix your problem.

The issue here is that the default sqlite timeout for writing is low, and when there is even small amounts of concurrent writes, sqlite will start throwing exceptions. This is common and well known.

The default should be something like 5-10 seconds. If you exceed this timeout then either increase it or chunk up your writes to the db.

Here is an example:
I return a DatabaseProxy here because this proxy allows sqlite to be swapped out for postgres without changing client code.

import atexit
from peewee import DatabaseProxy  # type: ignore
from playhouse.db_url import connect  # type: ignore
from playhouse.sqlite_ext import SqliteExtDatabase  # type: ignore

DB_TIMEOUT = 5

def create_db(db_path: str) -> DatabaseProxy:
    pragmas = (
        # Negative size is per api spec.
        ("cache_size", -1024 * 64),
        # wal speeds up writes.
        ("journal_mode", "wal"),
        ("foreign_keys", 1),
    )
    sqlite_db = SqliteExtDatabase(
        db_path,
        timeout=DB_TIMEOUT,
        pragmas=pragmas)
    sqlite_db.connect()
    atexit.register(sqlite_db.close)
    db_proxy: DatabaseProxy = DatabaseProxy()
    db_proxy.initialize(sqlite_db)
    return db_proxy

Answered By: niteris