Querying relational data in a reasonable amount of time

Question:

I have a spreadsheet with about 1.7m lines, totalling 1 GB, and need to perform queries on it.

Being most comfortable with Python, my first approach was to hack together a bunch of dictionaries keyed in a way that would facilitate the queries I was trying to make. E.g. if I needed to access everyone with a particular area code and age I would make an areacode_age 2-dimensional dictionary. I ended up needing quite a few which multiplied memory footprint to ~10GB, and even though I had enough RAM the process was slow.

I imported sqlite3 and imported my data into an in-memory database. Turns out doing a query like SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f takes 0.05 seconds. Doing this for my 1.7m rows would take 24 hours. My hacky approach with dictionaries was about 3 orders of magnitude faster for this particular task (and, in this example, I couldn’t key on date1 and date2, so I was getting every row that matched name and then filtering by date).

Why is this so slow, and how can I make it fast? And what is the Pythonic approach? Possibilities I’ve been considering:

  • sqlite3 is too slow and I need something more heavyweight.
  • I need to change my schema or my queries to be more optimized.
  • I need a new tool of some kind.
  • I read somewhere that in sqlite 3, doing repeated calls to cursor.execute is much slower than using cursor.executemany. It turns out executemany isn’t compatible with select statements though, so I think this was a red herring.
Asked By: Coquelicot

||

Answers:

It turns out though, that doing a query like “SELECT (a, b, c) FROM
foo WHERE date1<=d AND date2>e AND name=f” takes 0.05 seconds. Doing
this for my 1.7m rows would take 24 hours of compute time. My hacky
approach with dictionaries was about 3 orders of magnitude faster for
this particular task (and, in this example, I couldn’t key on date1
and date2 obviously, so I was getting every row that matched name and
then filtering by date).

Did you actually try this and observe that it was taking 24 hours? Processing time is not necessarily directly proportional to data size.

And are you suggesting that you might need to run SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f 1.7 million times? You only need to run it once, and it will return the entire subset of rows matching your query.

1.7 million rows is not small, but certainly not an issue for a database entirely in memory on your local computer. (No slow disk access; no slow network access.)


Proof is in the pudding. This is pretty fast for me (most of the time is spent in generating ~ 10 million random floats.)

import sqlite3, random

conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE numbers (a FLOAT, b FLOAT, c FLOAT, d FLOAT, e FLOAT, f FLOAT)");
for _ in xrange(1700000):
    data = [ random.random() for _ in xrange(6) ];
    conn.execute("INSERT INTO numbers VALUES (?,?,?,?,?,?)", data)

conn.commit()

print "done generating random numbers"

results = conn.execute("SELECT * FROM numbers WHERE a > 0.5 AND b < 0.5")
accumulator = 0
for row in results:
    accumulator += row[0]

print ("Sum of column `a` where a > 0.5 and b < 0.5 is %f" % accumulator)

Edit: Okay, so you really do need to run this 1.7 million times.

In that case, what you probably want is an index. To quote Wikipedia:Database Index:

A database index is a data structure that improves the speed of data
retrieval operations
on a database table at the cost of slower writes
and increased storage space. Indexes can be created using one or more
columns of a database table, providing the basis for both rapid random
lookups and efficient access of ordered records.

You would do something like CREATE INDEX dates_and_name ON foo(date1,date2,name) and then (I believe) execute the rest of your SELECT statements as usual. Try this and see if it speeds things up.

Answered By: Li-aung Yip

Since you are already talking SQL the easiest approach will be:

  1. Put all your data to MySQL table. It should perform well for 1.7 millions of rows.
  2. Add indexes you need, check for settings, make sure it will run fast on big table.
  3. Access it from Python
  4. Profit!
Answered By: lithuak

sqlite3 is too slow, and I need something more heavyweight

First, sqlite3 is fast, sometime faster than MySQL

Second, you have to use index, put a compound index in (date1, date2, name) will speed thing up significantly

Answered By: Kien Truong