ElasticSearch updates are not immediate, how do you wait for ElasticSearch to finish updating it's index?

Question:

I’m attempting to improve performance on a suite that tests against ElasticSearch.

The tests take a long time because Elasticsearch does not update it’s indexes immediately after updating. For instance, the following code runs without raising an assertion error.

from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')

# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
     index='blog',
     doc_type=,'blog'
     id=1,
     body={
        ....
    }
)

results = elasticsearch.search()
assert not results
# results are not populated

Currently out hacked together solution to this issue is dropping a time.sleep call into the code, to give ElasticSearch some time to update it’s indexes.

from time import sleep
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')

# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
     index='blog',
     doc_type=,'blog'
     id=1,
     body={
        ....
    }
)

# Don't want to use sleep functions
sleep(1)

results = elasticsearch.search()
assert len(results) == 1
# results are now populated

Obviously this isn’t great, as it’s rather failure prone, hypothetically if ElasticSearch takes longer than a second to update it’s indexes, despite how unlikely that is, the test will fail. Also it’s extremely slow when you’re running 100s of tests like this.

My attempt to solve the issue has been to query the pending cluster jobs to see if there are any tasks left to be done. However this doesn’t work, and this code will run without an assertion error.

from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')

# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
     index='blog',
     doc_type=,'blog'
     id=1,
     body={
        ....
    }
)

# Query if there are any pending tasks
while elasticsearch.cluster.pending_tasks()['tasks']:
    pass

results = elasticsearch.search()
assert not results
# results are not populated

So basically, back to my original question, ElasticSearch updates are not
immediate, how do you wait for ElasticSearch to finish updating it’s index?

Asked By: user916367

||

Answers:

As of version 5.0.0, elasticsearch has an option:

 ?refresh=wait_for

on the Index, Update, Delete, and Bulk api’s. This way, the request won’t receive a response until the result is visible in ElasticSearch. (Yay!)

See https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-refresh.html for more information.

edit: It seems that this functionality is already part of the latest Python elasticsearch api:
https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.index

Change your elasticsearch.update to:

elasticsearch.update(
     index='blog',
     doc_type='blog'
     id=1,
     refresh='wait_for',
     body={
        ....
    }
)

and you shouldn’t need any sleep or polling.

Answered By: TinkerTank

Seems to work for me:

els.indices.refresh(index)
els.cluster.health(wait_for_no_relocating_shards=True,wait_for_active_shards='all')
Answered By: Héctor Sánchez

You can also call elasticsearch.Refresh(‘blog’) if you don’t want to wait for the cluster refresh interval

Answered By: sramalingam24

If you use bulk helpers you can do it like this:

from elasticsearch.helpers import bulk    
bulk(client=self.es, actions=data, refresh='wait_for')
Answered By: Tobias Ernst

Elasticsearch do near real-time search. The updated/indexed document is not immediately searchable but only after the next refresh operation. The refresh is scheduled every 1 second.

To retrieve a document after updating/indexing, you should use GET api instead. By default, the get API is realtime, and is not affected by the refresh rate of the index. That means if the update/index was correctly done, you should see the modifications in the response of GET request.

If you insist on using SEARCH api to retrive a document after updating/indexing. Then from the documentation, there are 3 solutions:

  • Waiting for the refresh interval
  • Setting the ?refresh option in an index/update/delete request
  • Using the Refresh API to explicitly complete a refresh (POST _refresh) after an index/update request. However, please note that refreshes are resource-intensive.
Answered By: Đỗ Công Bằng