Bad search results due to sharding? Bad search results due to sharding? elasticsearch elasticsearch

Bad search results due to sharding?


The problem is that normal queries on Elasticsearch that span multiple shards use what's known as Query then Fetch:

Default search type: Query Then Fetch

By default, Elasticsearch will use a search type called “Query Then Fetch“. The way it works is as follows:

  • Send the query to each shard

  • Find all matching documents and calculate scores using local Term/Document Frequencies

  • Build a priority queue of results (sort, pagination with from/to, etc)

  • Return metadata about the results to requesting node. Note, the actual document is not sent yet, just the scores

  • Scores from all the shards are merged and sorted on the requesting node, docs are selected according to query criteria

  • Finally, the actual docs are retrieved from individual shards where they reside.

  • Results are returned to the client

This system usually works fine. In most cases, your index has “enough” documents to smooth out the Term/Document frequency statistics. So while each shard may not have complete knowledge of frequencies across the cluster, results are “good enough” because the frequencies are fairly similar everywhere.

http://www.elasticsearch.org/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch/

The problem for you is that it calculates the TF-IDF score locally -

What you will want to try is using DFS Query then Fetch, which will pre-query all the shards and calculate the scores using a global, not local, term document/frequency:

  • Prequery each shard asking about Term and Document frequencies

  • Send the query to each shard

  • Find all matching documents and calculate scores using global Term/Document Frequencies calculated from the prequery.

  • Build a priority queue of results (sort, pagination with from/to, etc)

  • Return metadata about the results to requesting node. Note, the actual document is not sent yet, just the scores

  • Scores from all the shards are merged and sorted on the requesting node, docs are selected according to query criteria

  • Finally, the actual docs are retrieved from individual shards where they reside.

  • Results are returned to the client

In your case I would use DFS Query then Fetch, but I'd also check out the various alternatives - Elasticsearch has a lot of flexibility in modifying the search request type:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html