ElasticSearch updates are not immediate, how do you wait for ElasticSearch to finish updating it's index?
Question:
I’m attempting to improve performance on a suite that tests against ElasticSearch.
The tests take a long time because Elasticsearch does not update it’s indexes immediately after updating. For instance, the following code runs without raising an assertion error.
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
results = elasticsearch.search()
assert not results
# results are not populated
Currently out hacked together solution to this issue is dropping a time.sleep
call into the code, to give ElasticSearch some time to update it’s indexes.
from time import sleep
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
# Don't want to use sleep functions
sleep(1)
results = elasticsearch.search()
assert len(results) == 1
# results are now populated
Obviously this isn’t great, as it’s rather failure prone, hypothetically if ElasticSearch takes longer than a second to update it’s indexes, despite how unlikely that is, the test will fail. Also it’s extremely slow when you’re running 100s of tests like this.
My attempt to solve the issue has been to query the pending cluster jobs to see if there are any tasks left to be done. However this doesn’t work, and this code will run without an assertion error.
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
# Query if there are any pending tasks
while elasticsearch.cluster.pending_tasks()['tasks']:
pass
results = elasticsearch.search()
assert not results
# results are not populated
So basically, back to my original question, ElasticSearch updates are not
immediate, how do you wait for ElasticSearch to finish updating it’s index?
Answers:
As of version 5.0.0, elasticsearch has an option:
?refresh=wait_for
on the Index, Update, Delete, and Bulk api’s. This way, the request won’t receive a response until the result is visible in ElasticSearch. (Yay!)
See https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-refresh.html for more information.
edit: It seems that this functionality is already part of the latest Python elasticsearch api:
https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.index
Change your elasticsearch.update to:
elasticsearch.update(
index='blog',
doc_type='blog'
id=1,
refresh='wait_for',
body={
....
}
)
and you shouldn’t need any sleep or polling.
Seems to work for me:
els.indices.refresh(index)
els.cluster.health(wait_for_no_relocating_shards=True,wait_for_active_shards='all')
You can also call elasticsearch.Refresh(‘blog’) if you don’t want to wait for the cluster refresh interval
If you use bulk helpers you can do it like this:
from elasticsearch.helpers import bulk
bulk(client=self.es, actions=data, refresh='wait_for')
Elasticsearch do near real-time search. The updated/indexed document is not immediately searchable but only after the next refresh operation. The refresh is scheduled every 1 second.
To retrieve a document after updating/indexing, you should use GET api instead. By default, the get API is realtime, and is not affected by the refresh rate of the index. That means if the update/index was correctly done, you should see the modifications in the response of GET request.
If you insist on using SEARCH api to retrive a document after updating/indexing. Then from the documentation, there are 3 solutions:
- Waiting for the refresh interval
- Setting the ?refresh option in an index/update/delete request
- Using the Refresh API to explicitly complete a refresh (POST _refresh) after an index/update request. However, please note that refreshes are resource-intensive.
I’m attempting to improve performance on a suite that tests against ElasticSearch.
The tests take a long time because Elasticsearch does not update it’s indexes immediately after updating. For instance, the following code runs without raising an assertion error.
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
results = elasticsearch.search()
assert not results
# results are not populated
Currently out hacked together solution to this issue is dropping a time.sleep
call into the code, to give ElasticSearch some time to update it’s indexes.
from time import sleep
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
# Don't want to use sleep functions
sleep(1)
results = elasticsearch.search()
assert len(results) == 1
# results are now populated
Obviously this isn’t great, as it’s rather failure prone, hypothetically if ElasticSearch takes longer than a second to update it’s indexes, despite how unlikely that is, the test will fail. Also it’s extremely slow when you’re running 100s of tests like this.
My attempt to solve the issue has been to query the pending cluster jobs to see if there are any tasks left to be done. However this doesn’t work, and this code will run without an assertion error.
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
# Query if there are any pending tasks
while elasticsearch.cluster.pending_tasks()['tasks']:
pass
results = elasticsearch.search()
assert not results
# results are not populated
So basically, back to my original question, ElasticSearch updates are not
immediate, how do you wait for ElasticSearch to finish updating it’s index?
As of version 5.0.0, elasticsearch has an option:
?refresh=wait_for
on the Index, Update, Delete, and Bulk api’s. This way, the request won’t receive a response until the result is visible in ElasticSearch. (Yay!)
See https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-refresh.html for more information.
edit: It seems that this functionality is already part of the latest Python elasticsearch api:
https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.index
Change your elasticsearch.update to:
elasticsearch.update(
index='blog',
doc_type='blog'
id=1,
refresh='wait_for',
body={
....
}
)
and you shouldn’t need any sleep or polling.
Seems to work for me:
els.indices.refresh(index)
els.cluster.health(wait_for_no_relocating_shards=True,wait_for_active_shards='all')
You can also call elasticsearch.Refresh(‘blog’) if you don’t want to wait for the cluster refresh interval
If you use bulk helpers you can do it like this:
from elasticsearch.helpers import bulk
bulk(client=self.es, actions=data, refresh='wait_for')
Elasticsearch do near real-time search. The updated/indexed document is not immediately searchable but only after the next refresh operation. The refresh is scheduled every 1 second.
To retrieve a document after updating/indexing, you should use GET api instead. By default, the get API is realtime, and is not affected by the refresh rate of the index. That means if the update/index was correctly done, you should see the modifications in the response of GET request.
If you insist on using SEARCH api to retrive a document after updating/indexing. Then from the documentation, there are 3 solutions:
- Waiting for the refresh interval
- Setting the ?refresh option in an index/update/delete request
- Using the Refresh API to explicitly complete a refresh (POST _refresh) after an index/update request. However, please note that refreshes are resource-intensive.