Indexing a SearchVector vs Having a SearchVectorField in Django. When should I use which?

Question:

Clearly I have some misunderstandings about the topic. I would appreciate if you correct my mistakes.

So as explained in PostgreSQL documentation, We need to do Full-Text Searching instead of using simple textual search operators.

Suppose I have a blog application in Django.

Entry.objects.filter(body_text__search="Cheese")

The bottom line is we have "document"s which are our individual records in blog_post field and a term "Cheese".

Individual documents are gonna be translated to something called "tsvector"(a vector of simplified words) and also a "tsquery" is created out of our term.

  1. If I have no SearchVectorField field and no SearchVector index:

    for every single record in body_text field, a tsvector is created and it’s checked against our tsquery, in failure, we continue to the next record.

  2. If I have SearchVectorField field but not SearchVector index:

    that tsvector vector is stored in SearchVectorField field. So the searching process is faster because we only check for match not creating tsvector anymore, but still we’re checking every single record one by one.

  3. If I have both SearchVectorField field and SearchVector index:

    a GIN index is created in database, it’s somehow like a dictionary: "cat": [3, 7, 18], .... It stores the occurrences of the "lexems"(words) so that we don’t have to iterate through all the records in the database. I think this is the fastest option.

  4. Now if I have only SearchVector index:

    we have all the benefits of number 3.

Then why should I have SearchVectorField field in my table? IOW why do I need to store tsvector if I already have it indexed?

Django documentation says:

If this approach becomes too slow, you can add a SearchVectorField to your model.

Thanks in advance.

Asked By: S.B

||

Answers:

Use No SearchVectorField and no SearchVector index when:

  • Your dataset is small.
  • The search operation is not performed frequently.
  • Computational resources are not a constraint.

Use SearchVectorField without SearchVector index when:

  • Your dataset is moderate-sized.
  • The search operation is not a frequent bottleneck.
  • Precomputing the tsvector improves search performance.

Use SearchVectorField with SearchVector index when:

  • Your dataset is large.
  • The search operation needs to be performed frequently.
  • Optimal search performance is crucial.
  • Storing the tsvector in a field and utilizing a GIN index provides the best performance.

ONLY use SearchVector index when:

  • You have limited storage space or don’t need to access the tsvector values directly.
  • Good search performance is still desired.
  • Storing the tsvector in a field is not necessary.

Edit:

Yes, your statements were correct in describing the different scenarios and their implications.

In the case of your question, from a speed perspective, having just a SearchVector index can indeed be sufficient in terms of search performance. The SearchVector index allows for efficient searching by leveraging the index structure, which speeds up the search process by avoiding the need to iterate through all the records.

In this case, having a SearchVectorField is not strictly necessary for achieving good search performance. The primary benefit of the SearchVectorField is that it allows you to store the tsvector values directly in the database, which can be useful if you need direct access to those values for other purposes. However, if you solely care about search speed and don’t need direct access to the tsvector values, having only the SearchVector index is sufficient.

Hoped this helped.

Answered By: Jbziscool

I assume an index without a field on the Django side means that it has a functional index. That is fine if your work_mem is large enough to hold the bitmap and you are only doing simple searches like single-word or &&. But if you are going proximity searches like <->, or if your work_mem is too small, it will need to do "rechecks" on potential matches, and that means the document would need to be parsed again.

Answered By: jjanes

I confirm that your statements are correct.

Others have already brought performance-related reasons for choosing a scenario, but there are even more practical reasons.

The fourth scenario, where you just have a functional index calculating the SearchVector, can almost always be used but has some limitations:

So to answer your question, you are forced to use a SearchVectorField, instead of a simple functional index, when in your full-text search on a model you also want to include fields from other models (eg: name of the author of an article) and also when the functions or operators you want to use in your functional index are not immutable (ex: Date functions)

You can see an example in these two slides from my talk "A Pythonic full-text search" I presented at PyCon US 2023 (https://www.paulox.net/2023/04/23/pycon-us-2023/):

  1. "SearchVector Field" https://speakerdeck.com/pauloxnet/a-pythonic-full-text-search-pycon-us-2022?slide=31
  2. "SearchVector field update" https://speakerdeck.com/pauloxnet/a-pythonic-full-text-search-pycon-us-2022?slide=32
Answered By: Paolo Melchiorre