Bad search results due to sharding?

The problem is that normal queries on Elasticsearch that span multiple shards use what's known as Query then Fetch:

Default search type: Query Then Fetch
By default, Elasticsearch will use a search type called “Query Then Fetch“. The way it works is as follows:
Send the query to each shard
Find all matching documents and calculate scores using local Term/Document Frequencies
Build a priority queue of results (sort, pagination with from/to, etc)
Return metadata about the results to requesting node. Note, the actual document is not sent yet, just the scores
Scores from all the shards are merged and sorted on the requesting node, docs are selected according to query criteria
Finally, the actual docs are retrieved from individual shards where they reside.
Results are returned to the client
This system usually works fine. In most cases, your index has “enough” documents to smooth out the Term/Document frequency statistics. So while each shard may not have complete knowledge of frequencies across the cluster, results are “good enough” because the frequencies are fairly similar everywhere.

http://www.elasticsearch.org/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch/

The problem for you is that it calculates the TF-IDF score locally -

What you will want to try is using DFS Query then Fetch, which will pre-query all the shards and calculate the scores using a global, not local, term document/frequency:

Prequery each shard asking about Term and Document frequencies
Send the query to each shard
Find all matching documents and calculate scores using global Term/Document Frequencies calculated from the prequery.
Build a priority queue of results (sort, pagination with from/to, etc)
Return metadata about the results to requesting node. Note, the actual document is not sent yet, just the scores
Scores from all the shards are merged and sorted on the requesting node, docs are selected according to query criteria
Finally, the actual docs are retrieved from individual shards where they reside.
Results are returned to the client

In your case I would use DFS Query then Fetch, but I'd also check out the various alternatives - Elasticsearch has a lot of flexibility in modifying the search request type:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html

CodeHunter

Bad search results due to sharding?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last