Django Table with Million of rows

Question:

I have a project with 2 applications ( books and reader ).

Books application has a table with 4 milions of rows with this fields:

 book_title = models.CharField(max_length=40)
 book_description = models.CharField(max_length=400)

To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).

In "reader" application, I am thinking to insert this fields:

reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()

So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").

Is a good solution to avoid to query a table with 4 milions of rows ?

Is there an alternative solution?

Asked By: xRobot

||

Answers:

ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.

Edit:
And for pete’s sake, keep it in one table and use indexes as appropriate.

I’m not familiar with Django, but I have a general understanding of DB.

When you have large databases, it’s pretty normal to index your database. That way, retrieving data, should be pretty quick.

When it comes to associate a book with a reader, you should create another table, that links reader to books.

It’s not a bad idea to divide the books into subjects. But I’m not sure what you mean by having 20 applications.

Answered By: Steven

Are you having performance problems? If so, you might need to add a few indexes.

One way to get an idea where an index would help is by looking at your db server’s query log (instructions here if you’re on MySQL).

If you’re not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.

Answered By: Seth

A common approach to this type of problem is Sharding. Unfortunately it’s mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I’m not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.

Perhaps you should look in to caching with something like memcached. Django supports this quite well.

Answered By: Jim Mitchener

Like many have said, it’s a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.

Indexes are the first step, it sounds like you’ve done this though. 4 million rows should be ok for the db to handle with an index.

Second, check the number of queries you’re running. You can do this with something like the django debug toolbar, and you’ll often be surprised how many unnecessary queries are being made.

Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.

If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don’t use genre to split the tables! pick something that you wont ever, ever change and that you’ll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn’t particularly big — this is why most people here are advising against it!

[edit]

I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you — but well before sharding or horizontal partitioning.

Answered By: Will Hardy

You haven’t mentioned which database you’re using. Some databases – like MySQL and PostgreSQL – have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.

If you tell us which database you’re using, and what hardware it’s running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.

For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you’ll need to alter shared_buffers and a number of other settings.

Answered By: Dan Fairs

You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.

Answered By: pujan rai