Elasticsearch python API: Delete documents by query

python elasticsearch pyes pyelasticsearch

Seeing as how elasticsearch has deprecated the delete by query API. I created this python script using the bindings to do the same thing. First thing define an ES connection:

import elasticsearches = elasticsearch.Elasticsearch(['localhost'])

Now you can use that to create a query for results you want to delete.

search=es.search(    q='The Query to ES.',    index="*logstash-*",    size=10,    search_type="scan",    scroll='5m',)

Now you can scroll that query in a loop. Generate our request while we do it.

 while True:    try:       # Git the next page of results.       scroll=es.scroll( scroll_id=search['_scroll_id'], scroll='5m', )    # Since scroll throws an error catch it and break the loop.     except elasticsearch.exceptions.NotFoundError:       break     # We have results initialize the bulk variable.     bulk = ""    for result in scroll['hits']['hits']:      bulk = bulk + '{ "delete" : { "_index" : "' + str(result['_index']) + '", "_type" : "' + str(result['_type']) + '", "_id" : "' + str(result['_id']) + '" } }\n'    # Finally do the deleting.     es.bulk( body=bulk )

To use the bulk api you need to ensure two things:

The document is identified You want to update. (index, type, id)
Each request is terminated with a newline or /n.

python elasticsearch pyes pyelasticsearch

The elasticsearch-py bulk API does allow you to delete records in bulk by including '_op_type': 'delete' in each record. However, if you want to delete-by-query you still need to make two queries: one to fetch the records to be deleted, and another to delete them.

The easiest way to do this in bulk is to use python module's scan() helper, which wraps the ElasticSearch Scroll API so you don't have to keep track of _scroll_ids. Use it with the bulk() helper as a replacement for the deprecated delete_by_query():

from elasticsearch.helpers import bulk, scanbulk_deletes = []for result in scan(es,                   query=es_query_body,  # same as the search() body parameter                   index=ES_INDEX,                   doc_type=ES_DOC,                   _source=False,                   track_scores=False,                   scroll='5m'):    result['_op_type'] = 'delete'    bulk_deletes.append(result)bulk(elasticsearch, bulk_deletes)

Since _source=False is passed, the document body is not returned so each result is pretty small. However, if do you have memory constraints, you can batch this pretty easily:

BATCH_SIZE = 100000i = 0bulk_deletes = []for result in scan(...):    if i == BATCH_SIZE:        bulk(elasticsearch, bulk_deletes)        bulk_deletes = []        i = 0    result['_op_type'] = 'delete'    bulk_deletes.append(result)    i += 1bulk(elasticsearch, bulk_deletes)

python elasticsearch pyes pyelasticsearch

I'm currently using this script based on @drs response, but using bulk() helper consistently. It has the ability to create batchs of jobs from a iterator by using chunk_size parameter (defaults to 500, see straming_bulk() for more info).

from elasticsearch import Elasticsearchfrom elasticsearch.helpers import scan, bulkBULK_SIZE = 1000def stream_items(es, query):    for e in scan(es,                   query=query,                   index=ES_INDEX,                  doc_type=ES_DOCTYPE,                   scroll='1m',                  _source=False):        # There exists a parameter to avoid this del statement (`track_source`) but at my version it doesn't exists.        del e['_score']        e['_op_type'] = 'delete'        yield ees = Elasticsearch(host='localhost')bulk(es, stream_items(es, query), chunk_size=BULK_SIZE)

CodeHunter

Elasticsearch python API: Delete documents by query

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last