SQLAlchemy 'bulk_save_objects' vs 'add_all' underlying logic difference?

Question:

Considering following three methods of using sqlalchemy ORM to insert objects:

(1)

for obj in objects:
    session.add(obj)

(2)

session.add_all(objects)

(3)

session.bulk_save_objects(objects)

Suppose the length of objects[] is 50000

  • Does method (1) form and send 50000 insert SQL queries?
  • Does method (2) form and send only 1 SQL query?
  • Does method (3) form and send only 1 SQL query?

I know these three methods differ a lot in speed. But what are the difference regarding to the underlying implementation details?

Asked By: AnnieFromTaiwan

||

Answers:

(2) is basically implemented as (1), and both may emit 50,000 inserts during flush, if the ORM has to fetch generated values such as primary keys. They may even emit more, if those 50,000 objects have relationships that cascade.

In [4]: session.add_all([Foo() for _ in range(5)])

In [5]: session.commit()
BEGIN (implicit)
INSERT INTO foo DEFAULT VALUES RETURNING foo.id
{}
... (repeats 3 times)
INSERT INTO foo DEFAULT VALUES RETURNING foo.id
{}
COMMIT

If you provide primary keys and other DB generated values beforehand, then the Session can combine separate inserts to a single “executemany” operation when the arguments match.

In [8]: session.add_all([Foo(id=i) for i in range(5)])

In [9]: session.commit()
BEGIN (implicit)
INSERT INTO foo (id) VALUES (%(id)s)
({'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4})
COMMIT

If your DB-API driver implements executemany() or equivalent using a method that allows it to issue a single statement with multiple data, then it can result in a single query. For example after enabling executemany_mode='values' the Postgresql log contains for the above

LOG: statement: INSERT INTO foo (id) VALUES (0),(1),(2),(3),(4)

The bulk operation skips most of the Session machinery — such as persisting related objects — in exchange for performance gains. For example by default it does not fetch default values, such as primary keys, which allows it to try and batch changes to fewer “executemany” operations where the operation and arguments match.

In [12]: session.bulk_save_objects([Foo() for _ in range(5)])
BEGIN (implicit)
INSERT INTO foo DEFAULT VALUES
({}, {}, {}, {}, {})

In [13]: session.commit()
COMMIT

It may still emit multiple statements, again depending on the data, and the DB-API driver in use. The documentation is a good read.

With psycopg2 fast execution helpers enabled the above produces in the Postgresql log

LOG: statement: INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES

In other words multiple statements have been joined to a “single” statement sent to the server.

So, in the end the answer to all 3 is “it depends”, which of course may seem frustrating.

Answered By: Ilja Everilä
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.