Elasticsearch python API: Delete documents by query Elasticsearch python API: Delete documents by query elasticsearch elasticsearch

Elasticsearch python API: Delete documents by query


Seeing as how elasticsearch has deprecated the delete by query API. I created this python script using the bindings to do the same thing. First thing define an ES connection:

import elasticsearches = elasticsearch.Elasticsearch(['localhost'])

Now you can use that to create a query for results you want to delete.

search=es.search(    q='The Query to ES.',    index="*logstash-*",    size=10,    search_type="scan",    scroll='5m',)

Now you can scroll that query in a loop. Generate our request while we do it.

 while True:    try:       # Git the next page of results.       scroll=es.scroll( scroll_id=search['_scroll_id'], scroll='5m', )    # Since scroll throws an error catch it and break the loop.     except elasticsearch.exceptions.NotFoundError:       break     # We have results initialize the bulk variable.     bulk = ""    for result in scroll['hits']['hits']:      bulk = bulk + '{ "delete" : { "_index" : "' + str(result['_index']) + '", "_type" : "' + str(result['_type']) + '", "_id" : "' + str(result['_id']) + '" } }\n'    # Finally do the deleting.     es.bulk( body=bulk )

To use the bulk api you need to ensure two things:

  1. The document is identified You want to update. (index, type, id)
  2. Each request is terminated with a newline or /n.


The elasticsearch-py bulk API does allow you to delete records in bulk by including '_op_type': 'delete' in each record. However, if you want to delete-by-query you still need to make two queries: one to fetch the records to be deleted, and another to delete them.

The easiest way to do this in bulk is to use python module's scan() helper, which wraps the ElasticSearch Scroll API so you don't have to keep track of _scroll_ids. Use it with the bulk() helper as a replacement for the deprecated delete_by_query():

from elasticsearch.helpers import bulk, scanbulk_deletes = []for result in scan(es,                   query=es_query_body,  # same as the search() body parameter                   index=ES_INDEX,                   doc_type=ES_DOC,                   _source=False,                   track_scores=False,                   scroll='5m'):    result['_op_type'] = 'delete'    bulk_deletes.append(result)bulk(elasticsearch, bulk_deletes)

Since _source=False is passed, the document body is not returned so each result is pretty small. However, if do you have memory constraints, you can batch this pretty easily:

BATCH_SIZE = 100000i = 0bulk_deletes = []for result in scan(...):    if i == BATCH_SIZE:        bulk(elasticsearch, bulk_deletes)        bulk_deletes = []        i = 0    result['_op_type'] = 'delete'    bulk_deletes.append(result)    i += 1bulk(elasticsearch, bulk_deletes)


I'm currently using this script based on @drs response, but using bulk() helper consistently. It has the ability to create batchs of jobs from a iterator by using chunk_size parameter (defaults to 500, see straming_bulk() for more info).

from elasticsearch import Elasticsearchfrom elasticsearch.helpers import scan, bulkBULK_SIZE = 1000def stream_items(es, query):    for e in scan(es,                   query=query,                   index=ES_INDEX,                  doc_type=ES_DOCTYPE,                   scroll='1m',                  _source=False):        # There exists a parameter to avoid this del statement (`track_source`) but at my version it doesn't exists.        del e['_score']        e['_op_type'] = 'delete'        yield ees = Elasticsearch(host='localhost')bulk(es, stream_items(es, query), chunk_size=BULK_SIZE)