Caching intermediate table with psycopg2

Question:

Take this block of psycopg2 calls which involve two SELECTs:

import psycopg2

with psycopg2.connect("dbname=test user=postgres") as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT a, b, c FROM table WHERE a > 5 and d < 10;")
        r1 = cur.fetchall()
        cur.execute("SELECT a, b, c FROM table WHERE a > 5 and d > 20;")
        r2 = cur.fetchall()

This is a bit inefficient; the potentially O(N) check WHERE a > 5 is performed twice when it seems that it could be performed just once, with subqueries performed on that intermediate result.

What’s the canonical way to do this via the psycopg2 API?

Something like:

with psycopg2.connect("dbname=test user=postgres") as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT a, b, c FROM table WHERE a > 5")
        # ...
        cur.execute("SELECT a, b, c FROM temp_table WHERE d < 10;")
        r1 = cur.fetchall()
        cur.execute("SELECT a, b, c FROM temp_table WHERE d > 20;")
        r2 = cur.fetchall()

Is the best solution to use a literal "CREATE TEMP TABLE..."?

I’m coming to this from a Django ORM perspective, where subsequent evaluations of the QuerySet reuse the cached results. Is there anything similar offered by the psycopg2 API?

Asked By: Brad Solomon

||

Answers:

You can execute a single query and split the results into two lists:

with psycopg2.connect("dbname=test user=postgres") as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT a, b, c, d FROM my_table WHERE a > 5 and (d < 10 or d > 20);")
        rows = cur.fetchall()
        
r1 = [(i[0], i[1], i[2]) for i in rows if i[3] < 10]
r2 = [(i[0], i[1], i[2]) for i in rows if i[3] > 20]

The above solution should be the most efficient in cases the result set is not enormous. Alternatively, you can create a temporary table:

with psycopg2.connect("dbname=test user=postgres") as conn:
    with conn.cursor() as cur:
        cur.execute("""
            CREATE TEMP TABLE t AS
            SELECT a, b, c, d 
            FROM my_table 
            WHERE a > 5 and (d < 10 or d > 20);""")
        cur.execute("SELECT a, b, c FROM t WHERE d < 10;")
        r1 = cur.fetchall()
        cur.execute("SELECT a, b, c FROM t WHERE d > 20;")
        r2 = cur.fetchall()        

The temp table will be automatically deleted when the connection is closed.

If the resultset is too large to be practically handled on the client side, use a server-side cursor. When you fetch single rows in a loop the rows are actually retrieved from the server in buckets. You can define the size of the buckets by setting itersize..

r1 = []
r2 = []
with psycopg2.connect("dbname=test user=postgres") as conn:
    with conn.cursor('my_cursor') as cur:
        cur.itersize = 1000
        cur.execute("SELECT a, b, c, d FROM my_table WHERE a < 5 and (d < 10 or d > 20);")
        for row in cur:
            if row[3] < 10:
                r1.append((row[0], row[1], row[2]))
            else:
                r2.append((row[0], row[1], row[2]))
Answered By: klin