Duplicate documents in Elasticsearch index with the same _uid Duplicate documents in Elasticsearch index with the same _uid elasticsearch elasticsearch

Duplicate documents in Elasticsearch index with the same _uid


We found the answer! The problem was that the index had unexpectedly switched the hashing algorithm it used for routing, and this caused some updated documents to be stored on different shards to their original versions.

A GET request to /index-name/_settings revealed this:

"version": {  "created": "1070599",  "upgraded": "2040699"},"legacy": {  "routing": {    "use_type": "false",    "hash": {      "type": "org.elasticsearch.cluster.routing.DjbHashFunction"    }  }}

"1070599" refers to Elasticsearch 1.7, and "2040699" is ES 2.4.

It looks like the index tried to upgrade itself from 1.7 to 2.4, despite the fact that it was already running 2.4. This is the issue described here: https://github.com/elastic/elasticsearch/issues/18459#issuecomment-220313383

We think this is what happened to trigger the change:

  1. Back when we upgraded the index from ES 1.7 to 2.4, we decided not to upgrade Elasticsearch in-place, since that would cause downtime. Instead, we created a separate ES 2.4 cluster.

    We loaded data into the new cluster using a tool that copied over all the index settings as well as the data, including the version setting which you should not set in ES 2.4.

  2. While dealing with a recent issue, we happened to close and reopen the index. This normally preserves all the data, but because of the incorrect version setting, it caused Elasticsearch to think that an upgrade was in processed.

  3. ES automatically set the legacy.routing.hash.type setting because of the false upgrade. This meant that any data indexed after this point used the old DjbHashFunction instead of the default Murmur3HashFunction which had been used to route the data originally.

This means that reindexing the data into a new index was the right thing to do to fix the issue. The new index has the correct version setting and no legacy hash function settings:

"version": {  "created": "2040699"}