Elasticsearch pagination

python elasticsearch

The default way of paginating over search results in Elasticsearch is using from/size parameters. This will, however, work only for the top 10k search results.

In case you need to go above that the way to go is search_after.

In case you need to dump the entire index, and it contains more than 10k documents, use scroll API.

What's the difference?

All of these queries allow to retrieve portions of search results, but they have major differences.

from/size is the cheapest and fastest, it is what Google would use to go for the second, third, etc. search results pages if it used Elasticsearch.

Scroll API is expensive, because it creates a kind of snapshot of the index the moment you create the first query, to make sure by the end of the scroll you will have exactly the data that was present in the index at the start. Doing a scroll request will cost resources, and running many of them in parallel can kill your performance, so proceed with caution.

Search after instead is a half-way between the two:

search_after is not a solution to jump freely to a random page but rather to scroll many queries in parallel. It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher. For this reason the sort order may change during a walk depending on the updates and deletes of your index.

So it will allow you to paginate above 10k, with a cost of some possible inconsistency.

Why the 10k limit?

index.max_result_window is set to 10k as a hard limit to avoid out of memory situations:

index.max_result_window
The maximum value of from + size for searches to this index. Defaults to 10000. Search requests take heap memory and time proportional to from + size and this limits that memory.

What about sliced scroll?

Sliced scroll is just a faster way of doing a normal scroll: it allows to download the collection of documents in parallel. Slice is just a subset of documents in the scroll query output.

python elasticsearch

    response_array = []response = ElkConfigClient.search index: "index_name",        body: {      query: {         bool: {           must: [            "search_query"          ]        }      }    },      scroll: '1h',       size: 1000    scroll_id = response["_scroll_id"]    s_id = scroll_id     #iterate the response    response["hits"]["hits"].each do |response|      response_array.push(response)    end        while (true)              next_response = ElkConfigClient.scroll(scroll_id: s_id, scroll: '1h')      next_scroll_id = next_response["_scroll_id"]      s_id = next_scroll_id      break if next_response["hits"]["hits"].length == 0       next_response["hits"]["hits"].each do |response|        response_array.push(response)      end      response_array    end

CodeHunter

Elasticsearch pagination

What's the difference?

Why the 10k limit?

What about sliced scroll?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last