Elasticsearch python API: Delete documents by query

Question:

I see that the following API will do delete by query in Elasticsearch – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

But I want to do the same with the elastic search bulk API, even though I could use bulk to upload docs using

es.bulk(body=json_batch)

I am not sure how to invoke delete by query using the python bulk API for Elastic search.

Asked By: sysuser

||

Answers:

Seeing as how elasticsearch has deprecated the delete by query API. I created this python script using the bindings to do the same thing.
First thing define an ES connection:

import elasticsearch
es = elasticsearch.Elasticsearch(['localhost'])

Now you can use that to create a query for results you want to delete.

search=es.search(
    q='The Query to ES.',
    index="*logstash-*",
    size=10,
    search_type="scan",
    scroll='5m',
)

Now you can scroll that query in a loop. Generate our request while we do it.

 while True:
    try: 
      # Git the next page of results. 
      scroll=es.scroll( scroll_id=search['_scroll_id'], scroll='5m', )
    # Since scroll throws an error catch it and break the loop. 
    except elasticsearch.exceptions.NotFoundError: 
      break 
    # We have results initialize the bulk variable. 
    bulk = ""
    for result in scroll['hits']['hits']:
      bulk = bulk + '{ "delete" : { "_index" : "' + str(result['_index']) + '", "_type" : "' + str(result['_type']) + '", "_id" : "' + str(result['_id']) + '" } }n'
    # Finally do the deleting. 
    es.bulk( body=bulk )

To use the bulk api you need to ensure two things:

  1. The document is identified You want to update. (index, type, id)
  2. Each request is terminated with a newline or /n.
Answered By: doug

Thanks, this was really useful!

I have two suggestions:

  1. When getting the next page of results with scroll, es.scroll(scroll_id=search['_scroll_id']) should be the _scroll_id returned in the last scroll, not the one the search returned. Elasticsearch does not update the scroll ID every time, especially with smaller requests (see this discussion), so this code might work, but it’s not foolproof.

  2. It’s important to clear scrolls as keeping search contexts open for a long time has a cost. Clear Scroll API – Elasticsearch API documentation They will close eventually after timeout, but if you’re low on disk space for example, it can save you a lot of headache.

An easy way is to build a list of scroll IDs on the go (make sure to get rid of duplicates!), and clear everything in the end.

es.clear_scroll(scroll_id=scroll_id_list)
Answered By: dori

The elasticsearch-py bulk API does allow you to delete records in bulk by including '_op_type': 'delete' in each record. However, if you want to delete-by-query you still need to make two queries: one to fetch the records to be deleted, and another to delete them.

The easiest way to do this in bulk is to use python module’s scan() helper, which wraps the ElasticSearch Scroll API so you don’t have to keep track of _scroll_ids. Use it with the bulk() helper as a replacement for the deprecated delete_by_query():

from elasticsearch.helpers import bulk, scan

bulk_deletes = []
for result in scan(es,
                   query=es_query_body,  # same as the search() body parameter
                   index=ES_INDEX,
                   doc_type=ES_DOC,
                   _source=False,
                   track_scores=False,
                   scroll='5m'):

    result['_op_type'] = 'delete'
    bulk_deletes.append(result)

bulk(elasticsearch, bulk_deletes)

Since _source=False is passed, the document body is not returned so each result is pretty small. However, if do you have memory constraints, you can batch this pretty easily:

BATCH_SIZE = 100000

i = 0
bulk_deletes = []
for result in scan(...):

    if i == BATCH_SIZE:
        bulk(elasticsearch, bulk_deletes)
        bulk_deletes = []
        i = 0

    result['_op_type'] = 'delete'
    bulk_deletes.append(result)

    i += 1

bulk(elasticsearch, bulk_deletes)
Answered By: drs

I’m currently using this script based on @drs response, but using bulk() helper consistently. It has the ability to create batchs of jobs from a iterator by using chunk_size parameter (defaults to 500, see straming_bulk() for more info).

from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan, bulk

BULK_SIZE = 1000

def stream_items(es, query):
    for e in scan(es, 
                  query=query, 
                  index=ES_INDEX,
                  doc_type=ES_DOCTYPE, 
                  scroll='1m',
                  _source=False):

        # There exists a parameter to avoid this del statement (`track_source`) but at my version it doesn't exists.
        del e['_score']
        e['_op_type'] = 'delete'
        yield e

es = Elasticsearch(host='localhost')
bulk(es, stream_items(es, query), chunk_size=BULK_SIZE)
Answered By: wiredrat

While operationally equivalent to many other answers, I personally find the following syntax more accessible:

import elasticsearch
from elasticsearch.helpers import bulk

es = elasticsearch.Elasticsearch(['localhost'])

ids = [1,2,3, ...]      # list of ids that will be deleted
index = "foo_index"     # index where the documents are indexed

actions = ({
    "_id": _id,
    "_op_type": "delete"
} for _id in ids)

bulk(client=es, actions=actions, index=index, refresh=True)
# `refresh=True` makes the result immediately available
Answered By: ciurlaro