SQLAlchemy 'bulk_save_objects' vs 'add_all' underlying logic difference?
Question:
Considering following three methods of using sqlalchemy ORM to insert objects:
(1)
for obj in objects:
session.add(obj)
(2)
session.add_all(objects)
(3)
session.bulk_save_objects(objects)
Suppose the length of objects[]
is 50000
- Does method (1) form and send
50000
insert SQL queries?
- Does method (2) form and send only
1
SQL query?
- Does method (3) form and send only
1
SQL query?
I know these three methods differ a lot in speed. But what are the difference regarding to the underlying implementation details?
Answers:
(2) is basically implemented as (1), and both may emit 50,000 inserts during flush, if the ORM has to fetch generated values such as primary keys. They may even emit more, if those 50,000 objects have relationships that cascade.
In [4]: session.add_all([Foo() for _ in range(5)])
In [5]: session.commit()
BEGIN (implicit)
INSERT INTO foo DEFAULT VALUES RETURNING foo.id
{}
... (repeats 3 times)
INSERT INTO foo DEFAULT VALUES RETURNING foo.id
{}
COMMIT
If you provide primary keys and other DB generated values beforehand, then the Session
can combine separate inserts to a single “executemany” operation when the arguments match.
In [8]: session.add_all([Foo(id=i) for i in range(5)])
In [9]: session.commit()
BEGIN (implicit)
INSERT INTO foo (id) VALUES (%(id)s)
({'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4})
COMMIT
If your DB-API driver implements executemany()
or equivalent using a method that allows it to issue a single statement with multiple data, then it can result in a single query. For example after enabling executemany_mode='values'
the Postgresql log contains for the above
LOG: statement: INSERT INTO foo (id) VALUES (0),(1),(2),(3),(4)
The bulk operation skips most of the Session
machinery — such as persisting related objects — in exchange for performance gains. For example by default it does not fetch default values, such as primary keys, which allows it to try and batch changes to fewer “executemany” operations where the operation and arguments match.
In [12]: session.bulk_save_objects([Foo() for _ in range(5)])
BEGIN (implicit)
INSERT INTO foo DEFAULT VALUES
({}, {}, {}, {}, {})
In [13]: session.commit()
COMMIT
It may still emit multiple statements, again depending on the data, and the DB-API driver in use. The documentation is a good read.
With psycopg2 fast execution helpers enabled the above produces in the Postgresql log
LOG: statement: INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES
In other words multiple statements have been joined to a “single” statement sent to the server.
So, in the end the answer to all 3 is “it depends”, which of course may seem frustrating.
Considering following three methods of using sqlalchemy ORM to insert objects:
(1)
for obj in objects:
session.add(obj)
(2)
session.add_all(objects)
(3)
session.bulk_save_objects(objects)
Suppose the length of objects[]
is 50000
- Does method (1) form and send
50000
insert SQL queries? - Does method (2) form and send only
1
SQL query? - Does method (3) form and send only
1
SQL query?
I know these three methods differ a lot in speed. But what are the difference regarding to the underlying implementation details?
(2) is basically implemented as (1), and both may emit 50,000 inserts during flush, if the ORM has to fetch generated values such as primary keys. They may even emit more, if those 50,000 objects have relationships that cascade.
In [4]: session.add_all([Foo() for _ in range(5)])
In [5]: session.commit()
BEGIN (implicit)
INSERT INTO foo DEFAULT VALUES RETURNING foo.id
{}
... (repeats 3 times)
INSERT INTO foo DEFAULT VALUES RETURNING foo.id
{}
COMMIT
If you provide primary keys and other DB generated values beforehand, then the Session
can combine separate inserts to a single “executemany” operation when the arguments match.
In [8]: session.add_all([Foo(id=i) for i in range(5)])
In [9]: session.commit()
BEGIN (implicit)
INSERT INTO foo (id) VALUES (%(id)s)
({'id': 0}, {'id': 1}, {'id': 2}, {'id': 3}, {'id': 4})
COMMIT
If your DB-API driver implements executemany()
or equivalent using a method that allows it to issue a single statement with multiple data, then it can result in a single query. For example after enabling executemany_mode='values'
the Postgresql log contains for the above
LOG: statement: INSERT INTO foo (id) VALUES (0),(1),(2),(3),(4)
The bulk operation skips most of the Session
machinery — such as persisting related objects — in exchange for performance gains. For example by default it does not fetch default values, such as primary keys, which allows it to try and batch changes to fewer “executemany” operations where the operation and arguments match.
In [12]: session.bulk_save_objects([Foo() for _ in range(5)])
BEGIN (implicit)
INSERT INTO foo DEFAULT VALUES
({}, {}, {}, {}, {})
In [13]: session.commit()
COMMIT
It may still emit multiple statements, again depending on the data, and the DB-API driver in use. The documentation is a good read.
With psycopg2 fast execution helpers enabled the above produces in the Postgresql log
LOG: statement: INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES;INSERT INTO foo DEFAULT VALUES
In other words multiple statements have been joined to a “single” statement sent to the server.
So, in the end the answer to all 3 is “it depends”, which of course may seem frustrating.