elasticsearch scrolling using python client elasticsearch scrolling using python client elasticsearch elasticsearch

elasticsearch scrolling using python client


Using python requests

import requestsimport jsonelastic_url = 'http://localhost:9200/my_index/_search?scroll=1m'scroll_api_url = 'http://localhost:9200/_search/scroll'headers = {'Content-Type': 'application/json'}payload = {    "size": 100,    "sort": ["_doc"]    "query": {        "match" : {            "title" : "elasticsearch"        }    }}r1 = requests.request(    "POST",    elastic_url,    data=json.dumps(payload),    headers=headers)# first batch datatry:    res_json = r1.json()    data = res_json['hits']['hits']    _scroll_id = res_json['_scroll_id']except KeyError:    data = []    _scroll_id = None    print 'Error: Elastic Search: %s' % str(r1.json())while data:    print data    # scroll to get next batch data    scroll_payload = json.dumps({        'scroll': '1m',        'scroll_id': _scroll_id    })    scroll_res = requests.request(        "POST", scroll_api_url,        data=scroll_payload,        headers=headers    )    try:        res_json = scroll_res.json()        data = res_json['hits']['hits']        _scroll_id = res_json['_scroll_id']    except KeyError:        data = []        _scroll_id = None        err_msg = 'Error: Elastic Search Scroll: %s'        print err_msg % str(scroll_res.json())

Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#search-request-scroll


This is an old question, but for some reason came up first when searching for "elasticsearch python scroll". The python module provides a helper method to do all the work for you. It is a generator function that will return each document to you while managing the underlying scroll ids.

https://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan

Here is an example of usage:

from elasticsearch import Elasticsearchfrom elasticsearch.helpers import scanquery = {    "query": {"match_all": {}}}es = Elasticsearch(...)for hit in scan(es, index="my-index", query=query):    print(hit["_source"]["field"])


In fact the code has a bug in it - in order to use the scroll feature correctly you are supposed to use the new scroll_id returned with each new call in the next call to scroll(), not reuse the first one:

Important

The initial search request and each subsequent scroll request returns a new scroll_id — only the most recent scroll_id should be used.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

It's working because Elasticsearch does not always change the scroll_id in between calls and can for smaller result sets return the same scroll_id as was originally returned for some time. This discussion from last year is between two other users seeing the same issue, the same scroll_id being returned for awhile:

http://elasticsearch-users.115913.n3.nabble.com/Distributing-query-results-using-scrolling-td4036726.html

So while your code is working for a smaller result set it's not correct - you need to capture the scroll_id returned in each new call to scroll() and use that for the next call.