How can I get all rows with keys provided in a list using SQLalchemy?

Question:

I have sequence of IDs I want to retrieve. It’s simple:

session.query(Record).filter(Record.id.in_(seq)).all()

Is there a better way to do it?

Asked By: Cheery

||

Answers:

I’d recommend to take a look at the SQL it produces. You can just print str(query) to see it.

I’m not aware of an ideal way of doing it with standard SQL.

Answered By: iny

Your code is absolutety fine.

IN is like a bunch of X=Y joined with OR and is pretty fast in contemporary databases.

However, if your list of IDs is long, you could make the query a bit more efficient by passing a sub-query returning the list of IDs.

Answered By: Adam Dziendziel

If you use composite primary keys, you can use tuple_, as in

from sqlalchemy import tuple_
session.query(Record).filter(tuple_(Record.id1, Record.id2).in_(seq)).all()

Note that this is not available on SQLite (see doc).

Answered By: G.J

There is one other way; If it’s reasonable to expect that the objects in question are already loaded into the session; you’ve accessed them before in the same transaction, you can instead do:

map(session.query(Record).get, seq)

In the case where those objects are already present, this will be much faster, since there won’t be any queries to retrieve those objects; On the other hand, if more than a tiny number of those objects are not loaded, it will be much, much slower, since it will cause a query per missing instance, instead of a single query for all objects.

This can be useful when you are doing joinedload() queries before reaching the above step, so you can be sure that they have been loaded already. In general, you should use the solution in the question by default, and only explore this solution when you have seen that you are querying for the same objects over and over.

The code as is is completely fine. However, someone is asking me for some system of hedging between the two approaches of doing a big IN vs. using get() for individual IDs.

If someone is really trying to avoid the SELECT, then the best way to do that is to set up the objects you need in memory ahead of time. Such as, you’re working on a large table of elements. Break up the work into chunks, such as, order the full set of work by primary key, or by date range, whatever, then load everything for that chunk locally into a cache:

 all_ids = [<huge list of ids>]

 all_ids.sort()
 while all_ids:
     chunk = all_ids[0:1000]

     # bonus exercise!  Throw each chunk into a multiprocessing.pool()!
     all_ids = all_ids[1000:]

     my_cache = dict(
           Session.query(Record.id, Record).filter(
                 Record.id.between(chunk[0], chunk[-1]))
     )

     for id_ in chunk:
         my_obj = my_cache[id_]
         <work on my_obj>

That’s the real world use case.

But to also illustrate some SQLAlchemy API, we can make a function that does the IN for records we don’t have and a local get for those we do. Here is that:

from sqlalchemy import inspect


def get_all(session, cls, seq):
    mapper = inspect(cls)
    lookup = set()
    for ident in seq:
        key = mapper.identity_key_from_primary_key((ident, ))
        if key in session.identity_map:
            yield session.identity_map[key]
        else:
            lookup.add(ident)
    if lookup:
        for obj in session.query(cls).filter(cls.id.in_(lookup)):
            yield obj

Here is a demonstration:

from sqlalchemy import Column, Integer, create_engine, String
from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base
import random

Base = declarative_base()


class A(Base):
    __tablename__ = 'a'
    id = Column(Integer, primary_key=True)
    data = Column(String)

e = create_engine("sqlite://", echo=True)
Base.metadata.create_all(e)

ids = range(1, 50)

s = Session(e)
s.add_all([A(id=i, data='a%d' % i) for i in ids])
s.commit()
s.close()

already_loaded = s.query(A).filter(A.id.in_(random.sample(ids, 10))).all()

assert len(s.identity_map) == 10

to_load = set(random.sample(ids, 25))
all_ = list(get_all(s, A, to_load))

assert set(x.id for x in all_) == to_load
Answered By: zzzeek
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.