Indexing a SearchVector vs Having a SearchVectorField in Django. When should I use which?
Question:
Clearly I have some misunderstandings about the topic. I would appreciate if you correct my mistakes.
So as explained in PostgreSQL documentation, We need to do Full-Text Searching instead of using simple textual search operators.
Suppose I have a blog application in Django.
Entry.objects.filter(body_text__search="Cheese")
The bottom line is we have "document"s which are our individual records in blog_post
field and a term "Cheese"
.
Individual documents are gonna be translated to something called "tsvector"(a vector of simplified words) and also a "tsquery" is created out of our term.
-
If I have no SearchVectorField
field and no SearchVector
index:
for every single record in body_text
field, a tsvector
is created and it’s checked against our tsquery
, in failure, we continue to the next record.
-
If I have SearchVectorField
field but not SearchVector
index:
that tsvector
vector is stored in SearchVectorField
field. So the searching process is faster because we only check for match not creating tsvector anymore, but still we’re checking every single record one by one.
-
If I have both SearchVectorField
field and SearchVector
index:
a GIN index is created in database, it’s somehow like a dictionary: "cat": [3, 7, 18], ...
. It stores the occurrences of the "lexems"(words) so that we don’t have to iterate through all the records in the database. I think this is the fastest option.
-
Now if I have only SearchVector
index:
we have all the benefits of number 3.
Then why should I have SearchVectorField
field in my table? IOW why do I need to store tsvector
if I already have it indexed?
Django documentation says:
If this approach becomes too slow, you can add a SearchVectorField
to your model.
Thanks in advance.
Answers:
Use No SearchVectorField
and no SearchVector
index when:
- Your dataset is small.
- The search operation is not performed frequently.
- Computational resources are not a constraint.
Use SearchVectorField
without SearchVector
index when:
- Your dataset is moderate-sized.
- The search operation is not a frequent bottleneck.
- Precomputing the tsvector improves search performance.
Use SearchVectorField
with SearchVector
index when:
- Your dataset is large.
- The search operation needs to be performed frequently.
- Optimal search performance is crucial.
- Storing the tsvector in a field and utilizing a GIN index provides the best performance.
ONLY use SearchVector
index when:
- You have limited storage space or don’t need to access the tsvector values directly.
- Good search performance is still desired.
- Storing the tsvector in a field is not necessary.
Edit:
Yes, your statements were correct in describing the different scenarios and their implications.
In the case of your question, from a speed perspective, having just a SearchVector
index can indeed be sufficient in terms of search performance. The SearchVector
index allows for efficient searching by leveraging the index structure, which speeds up the search process by avoiding the need to iterate through all the records.
In this case, having a SearchVectorField
is not strictly necessary for achieving good search performance. The primary benefit of the SearchVectorField
is that it allows you to store the tsvector values directly in the database, which can be useful if you need direct access to those values for other purposes. However, if you solely care about search speed and don’t need direct access to the tsvector values, having only the SearchVector
index is sufficient.
Hoped this helped.
I assume an index without a field on the Django side means that it has a functional index. That is fine if your work_mem is large enough to hold the bitmap and you are only doing simple searches like single-word or &&
. But if you are going proximity searches like <->
, or if your work_mem is too small, it will need to do "rechecks" on potential matches, and that means the document would need to be parsed again.
I confirm that your statements are correct.
Others have already brought performance-related reasons for choosing a scenario, but there are even more practical reasons.
The fourth scenario, where you just have a functional index calculating the SearchVector, can almost always be used but has some limitations:
- the functional index can only refer to the fields of the model you are searching on. ("An index column need not be just a column of the underlying table …") https://www.postgresql.org/docs/current/indexes-expressional.html
- "PostgreSQL requires functions and operators referenced in an index to be marked as IMMUTABLE. Django doesn’t validate this but PostgreSQL will error. This means that functions such as Concat() aren’t accepted."
https://docs.djangoproject.com/en/4.2/ref/models/indexes/#s-expressions
So to answer your question, you are forced to use a SearchVectorField, instead of a simple functional index, when in your full-text search on a model you also want to include fields from other models (eg: name of the author of an article) and also when the functions or operators you want to use in your functional index are not immutable (ex: Date functions)
You can see an example in these two slides from my talk "A Pythonic full-text search" I presented at PyCon US 2023 (https://www.paulox.net/2023/04/23/pycon-us-2023/):
- "SearchVector Field" https://speakerdeck.com/pauloxnet/a-pythonic-full-text-search-pycon-us-2022?slide=31
- "SearchVector field update" https://speakerdeck.com/pauloxnet/a-pythonic-full-text-search-pycon-us-2022?slide=32
Clearly I have some misunderstandings about the topic. I would appreciate if you correct my mistakes.
So as explained in PostgreSQL documentation, We need to do Full-Text Searching instead of using simple textual search operators.
Suppose I have a blog application in Django.
Entry.objects.filter(body_text__search="Cheese")
The bottom line is we have "document"s which are our individual records in blog_post
field and a term "Cheese"
.
Individual documents are gonna be translated to something called "tsvector"(a vector of simplified words) and also a "tsquery" is created out of our term.
-
If I have no
SearchVectorField
field and noSearchVector
index:for every single record in
body_text
field, atsvector
is created and it’s checked against ourtsquery
, in failure, we continue to the next record. -
If I have
SearchVectorField
field but notSearchVector
index:that
tsvector
vector is stored inSearchVectorField
field. So the searching process is faster because we only check for match not creating tsvector anymore, but still we’re checking every single record one by one. -
If I have both
SearchVectorField
field andSearchVector
index:a GIN index is created in database, it’s somehow like a dictionary:
"cat": [3, 7, 18], ...
. It stores the occurrences of the "lexems"(words) so that we don’t have to iterate through all the records in the database. I think this is the fastest option. -
Now if I have only
SearchVector
index:we have all the benefits of number 3.
Then why should I have SearchVectorField
field in my table? IOW why do I need to store tsvector
if I already have it indexed?
Django documentation says:
If this approach becomes too slow, you can add a
SearchVectorField
to your model.
Thanks in advance.
Use No SearchVectorField
and no SearchVector
index when:
- Your dataset is small.
- The search operation is not performed frequently.
- Computational resources are not a constraint.
Use SearchVectorField
without SearchVector
index when:
- Your dataset is moderate-sized.
- The search operation is not a frequent bottleneck.
- Precomputing the tsvector improves search performance.
Use SearchVectorField
with SearchVector
index when:
- Your dataset is large.
- The search operation needs to be performed frequently.
- Optimal search performance is crucial.
- Storing the tsvector in a field and utilizing a GIN index provides the best performance.
ONLY use SearchVector
index when:
- You have limited storage space or don’t need to access the tsvector values directly.
- Good search performance is still desired.
- Storing the tsvector in a field is not necessary.
Edit:
Yes, your statements were correct in describing the different scenarios and their implications.
In the case of your question, from a speed perspective, having just a SearchVector
index can indeed be sufficient in terms of search performance. The SearchVector
index allows for efficient searching by leveraging the index structure, which speeds up the search process by avoiding the need to iterate through all the records.
In this case, having a SearchVectorField
is not strictly necessary for achieving good search performance. The primary benefit of the SearchVectorField
is that it allows you to store the tsvector values directly in the database, which can be useful if you need direct access to those values for other purposes. However, if you solely care about search speed and don’t need direct access to the tsvector values, having only the SearchVector
index is sufficient.
Hoped this helped.
I assume an index without a field on the Django side means that it has a functional index. That is fine if your work_mem is large enough to hold the bitmap and you are only doing simple searches like single-word or &&
. But if you are going proximity searches like <->
, or if your work_mem is too small, it will need to do "rechecks" on potential matches, and that means the document would need to be parsed again.
I confirm that your statements are correct.
Others have already brought performance-related reasons for choosing a scenario, but there are even more practical reasons.
The fourth scenario, where you just have a functional index calculating the SearchVector, can almost always be used but has some limitations:
- the functional index can only refer to the fields of the model you are searching on. ("An index column need not be just a column of the underlying table …") https://www.postgresql.org/docs/current/indexes-expressional.html
- "PostgreSQL requires functions and operators referenced in an index to be marked as IMMUTABLE. Django doesn’t validate this but PostgreSQL will error. This means that functions such as Concat() aren’t accepted."
https://docs.djangoproject.com/en/4.2/ref/models/indexes/#s-expressions
So to answer your question, you are forced to use a SearchVectorField, instead of a simple functional index, when in your full-text search on a model you also want to include fields from other models (eg: name of the author of an article) and also when the functions or operators you want to use in your functional index are not immutable (ex: Date functions)
You can see an example in these two slides from my talk "A Pythonic full-text search" I presented at PyCon US 2023 (https://www.paulox.net/2023/04/23/pycon-us-2023/):
- "SearchVector Field" https://speakerdeck.com/pauloxnet/a-pythonic-full-text-search-pycon-us-2022?slide=31
- "SearchVector field update" https://speakerdeck.com/pauloxnet/a-pythonic-full-text-search-pycon-us-2022?slide=32