Elasticsearch python API: Delete documents by query
Question:
I see that the following API will do delete by query in Elasticsearch – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
But I want to do the same with the elastic search bulk API, even though I could use bulk to upload docs using
es.bulk(body=json_batch)
I am not sure how to invoke delete by query using the python bulk API for Elastic search.
Answers:
Seeing as how elasticsearch has deprecated the delete by query API. I created this python script using the bindings to do the same thing.
First thing define an ES connection:
import elasticsearch
es = elasticsearch.Elasticsearch(['localhost'])
Now you can use that to create a query for results you want to delete.
search=es.search(
q='The Query to ES.',
index="*logstash-*",
size=10,
search_type="scan",
scroll='5m',
)
Now you can scroll that query in a loop. Generate our request while we do it.
while True:
try:
# Git the next page of results.
scroll=es.scroll( scroll_id=search['_scroll_id'], scroll='5m', )
# Since scroll throws an error catch it and break the loop.
except elasticsearch.exceptions.NotFoundError:
break
# We have results initialize the bulk variable.
bulk = ""
for result in scroll['hits']['hits']:
bulk = bulk + '{ "delete" : { "_index" : "' + str(result['_index']) + '", "_type" : "' + str(result['_type']) + '", "_id" : "' + str(result['_id']) + '" } }n'
# Finally do the deleting.
es.bulk( body=bulk )
To use the bulk api you need to ensure two things:
- The document is identified You want to update. (index, type, id)
- Each request is terminated with a newline or /n.
Thanks, this was really useful!
I have two suggestions:
-
When getting the next page of results with scroll, es.scroll(scroll_id=search['_scroll_id'])
should be the _scroll_id
returned in the last scroll, not the one the search returned. Elasticsearch does not update the scroll ID every time, especially with smaller requests (see this discussion), so this code might work, but it’s not foolproof.
-
It’s important to clear scrolls as keeping search contexts open for a long time has a cost. Clear Scroll API – Elasticsearch API documentation They will close eventually after timeout, but if you’re low on disk space for example, it can save you a lot of headache.
An easy way is to build a list of scroll IDs on the go (make sure to get rid of duplicates!), and clear everything in the end.
es.clear_scroll(scroll_id=scroll_id_list)
The elasticsearch-py
bulk API does allow you to delete records in bulk by including '_op_type': 'delete'
in each record. However, if you want to delete-by-query you still need to make two queries: one to fetch the records to be deleted, and another to delete them.
The easiest way to do this in bulk is to use python module’s scan()
helper, which wraps the ElasticSearch Scroll API so you don’t have to keep track of _scroll_id
s. Use it with the bulk()
helper as a replacement for the deprecated delete_by_query()
:
from elasticsearch.helpers import bulk, scan
bulk_deletes = []
for result in scan(es,
query=es_query_body, # same as the search() body parameter
index=ES_INDEX,
doc_type=ES_DOC,
_source=False,
track_scores=False,
scroll='5m'):
result['_op_type'] = 'delete'
bulk_deletes.append(result)
bulk(elasticsearch, bulk_deletes)
Since _source=False
is passed, the document body is not returned so each result is pretty small. However, if do you have memory constraints, you can batch this pretty easily:
BATCH_SIZE = 100000
i = 0
bulk_deletes = []
for result in scan(...):
if i == BATCH_SIZE:
bulk(elasticsearch, bulk_deletes)
bulk_deletes = []
i = 0
result['_op_type'] = 'delete'
bulk_deletes.append(result)
i += 1
bulk(elasticsearch, bulk_deletes)
I’m currently using this script based on @drs response, but using bulk() helper consistently. It has the ability to create batchs of jobs from a iterator by using chunk_size
parameter (defaults to 500, see straming_bulk() for more info).
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan, bulk
BULK_SIZE = 1000
def stream_items(es, query):
for e in scan(es,
query=query,
index=ES_INDEX,
doc_type=ES_DOCTYPE,
scroll='1m',
_source=False):
# There exists a parameter to avoid this del statement (`track_source`) but at my version it doesn't exists.
del e['_score']
e['_op_type'] = 'delete'
yield e
es = Elasticsearch(host='localhost')
bulk(es, stream_items(es, query), chunk_size=BULK_SIZE)
While operationally equivalent to many other answers, I personally find the following syntax more accessible:
import elasticsearch
from elasticsearch.helpers import bulk
es = elasticsearch.Elasticsearch(['localhost'])
ids = [1,2,3, ...] # list of ids that will be deleted
index = "foo_index" # index where the documents are indexed
actions = ({
"_id": _id,
"_op_type": "delete"
} for _id in ids)
bulk(client=es, actions=actions, index=index, refresh=True)
# `refresh=True` makes the result immediately available
I see that the following API will do delete by query in Elasticsearch – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
But I want to do the same with the elastic search bulk API, even though I could use bulk to upload docs using
es.bulk(body=json_batch)
I am not sure how to invoke delete by query using the python bulk API for Elastic search.
Seeing as how elasticsearch has deprecated the delete by query API. I created this python script using the bindings to do the same thing.
First thing define an ES connection:
import elasticsearch
es = elasticsearch.Elasticsearch(['localhost'])
Now you can use that to create a query for results you want to delete.
search=es.search(
q='The Query to ES.',
index="*logstash-*",
size=10,
search_type="scan",
scroll='5m',
)
Now you can scroll that query in a loop. Generate our request while we do it.
while True:
try:
# Git the next page of results.
scroll=es.scroll( scroll_id=search['_scroll_id'], scroll='5m', )
# Since scroll throws an error catch it and break the loop.
except elasticsearch.exceptions.NotFoundError:
break
# We have results initialize the bulk variable.
bulk = ""
for result in scroll['hits']['hits']:
bulk = bulk + '{ "delete" : { "_index" : "' + str(result['_index']) + '", "_type" : "' + str(result['_type']) + '", "_id" : "' + str(result['_id']) + '" } }n'
# Finally do the deleting.
es.bulk( body=bulk )
To use the bulk api you need to ensure two things:
- The document is identified You want to update. (index, type, id)
- Each request is terminated with a newline or /n.
Thanks, this was really useful!
I have two suggestions:
-
When getting the next page of results with scroll,
es.scroll(scroll_id=search['_scroll_id'])
should be the_scroll_id
returned in the last scroll, not the one the search returned. Elasticsearch does not update the scroll ID every time, especially with smaller requests (see this discussion), so this code might work, but it’s not foolproof. -
It’s important to clear scrolls as keeping search contexts open for a long time has a cost. Clear Scroll API – Elasticsearch API documentation They will close eventually after timeout, but if you’re low on disk space for example, it can save you a lot of headache.
An easy way is to build a list of scroll IDs on the go (make sure to get rid of duplicates!), and clear everything in the end.
es.clear_scroll(scroll_id=scroll_id_list)
The elasticsearch-py
bulk API does allow you to delete records in bulk by including '_op_type': 'delete'
in each record. However, if you want to delete-by-query you still need to make two queries: one to fetch the records to be deleted, and another to delete them.
The easiest way to do this in bulk is to use python module’s scan()
helper, which wraps the ElasticSearch Scroll API so you don’t have to keep track of _scroll_id
s. Use it with the bulk()
helper as a replacement for the deprecated delete_by_query()
:
from elasticsearch.helpers import bulk, scan
bulk_deletes = []
for result in scan(es,
query=es_query_body, # same as the search() body parameter
index=ES_INDEX,
doc_type=ES_DOC,
_source=False,
track_scores=False,
scroll='5m'):
result['_op_type'] = 'delete'
bulk_deletes.append(result)
bulk(elasticsearch, bulk_deletes)
Since _source=False
is passed, the document body is not returned so each result is pretty small. However, if do you have memory constraints, you can batch this pretty easily:
BATCH_SIZE = 100000
i = 0
bulk_deletes = []
for result in scan(...):
if i == BATCH_SIZE:
bulk(elasticsearch, bulk_deletes)
bulk_deletes = []
i = 0
result['_op_type'] = 'delete'
bulk_deletes.append(result)
i += 1
bulk(elasticsearch, bulk_deletes)
I’m currently using this script based on @drs response, but using bulk() helper consistently. It has the ability to create batchs of jobs from a iterator by using chunk_size
parameter (defaults to 500, see straming_bulk() for more info).
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan, bulk
BULK_SIZE = 1000
def stream_items(es, query):
for e in scan(es,
query=query,
index=ES_INDEX,
doc_type=ES_DOCTYPE,
scroll='1m',
_source=False):
# There exists a parameter to avoid this del statement (`track_source`) but at my version it doesn't exists.
del e['_score']
e['_op_type'] = 'delete'
yield e
es = Elasticsearch(host='localhost')
bulk(es, stream_items(es, query), chunk_size=BULK_SIZE)
While operationally equivalent to many other answers, I personally find the following syntax more accessible:
import elasticsearch
from elasticsearch.helpers import bulk
es = elasticsearch.Elasticsearch(['localhost'])
ids = [1,2,3, ...] # list of ids that will be deleted
index = "foo_index" # index where the documents are indexed
actions = ({
"_id": _id,
"_op_type": "delete"
} for _id in ids)
bulk(client=es, actions=actions, index=index, refresh=True)
# `refresh=True` makes the result immediately available