ElasticSearch returning only documents with distinct value ElasticSearch returning only documents with distinct value elasticsearch elasticsearch

ElasticSearch returning only documents with distinct value


You can eliminate duplicates using aggregations. With term aggregation the results will be grouped by one field, e.g. name, also providing a count of the ocurrences of each value of the field, and will sort the results by this count (descending).

{  "query": {    "fuzzy_like_this_field": {      "favorite_cars": {        "like_text": "toyota",        "max_query_terms": 12      }    }  },  "aggs": {    "grouped_by_name": {      "terms": {        "field": "name",        "size": 0      }    }  }}

In addition to the hits, the result will also contain the buckets with the unique values in key and with the count in doc_count:

{  "took" : 4,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 2,    "max_score" : 0.19178301,    "hits" : [ {      "_index" : "pru",      "_type" : "pru",      "_id" : "vGkoVV5cR8SN3lvbWzLaFQ",      "_score" : 0.19178301,      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}    }, {      "_index" : "pru",      "_type" : "pru",      "_id" : "IdEbAcI6TM6oCVxCI_3fug",      "_score" : 0.19178301,      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}    } ]  },  "aggregations" : {    "grouped_by_name" : {      "buckets" : [ {        "key" : "abc",        "doc_count" : 2      } ]    }  }}

Note that using aggregations will be costly because of duplicate elimination and result sorting.


ElasticSearch doesn't provide any query by which you can get distinct documents based a field value.

Ideally you should have indexed the same document with same type and id since these two things are used by ElasticSearch to give a _uid unique id to a document. Unique id is important not only because of its way of detecting duplicate documents but also updating the same document in case of any modification instead of inserting a new one. For more information about indexing documents you can read this.

But there is definitely a work around for your problem. Since you are using java api client, you can remove duplicate documents based on a field value on your own. Infact, it gives you more flexibility to perform custom operations on the responses that you get from ES.

SearchResponse response = client.prepareSearch().execute().actionGet();SearchHits hits = response.getHits();Iterator<SearchHit> iterator = hits.iterator();Map<String, SearchHit> distinctObjects = new HashMap<String,SearchHit>();while (iterator.hasNext()) {    SearchHit searchHit = (SearchHit) iterator.next();    Map<String, Object> source = searchHit.getSource();    if(source.get("name") != null){        distinctObjects.put(source.get("name").toString(),source);    }} 

So, you will have a map of unique searchHit objects in your map.

You can also create an object mapping and use that in place of SearchHit.

I hope this solves your problem. Please forgive me if there are any errors in the code. This is just a pseudo-ish code to make you understand how you can solve your problem.

Thanks


@JRL is almost corrrect. You will need an aggregation in your query. This will get you a list of the top 10000 "favorite_cars" in your object ordered by occurance.

{    "query":{ "match_all":{ } },    "size":0,    "Distinct" : {        "Cars" : {            "terms" : { "field" : "favorite_cars", "order": { "_count": "desc"}, "size":10000 }         }    }}

It is also worth noting that you are going to want your "favorite_car" field to not be analyzed in order to get "McLaren F1" instead of "McLaren ", "F1".

"favorite_car": {    "type": "string",    "index": "not_analyzed"}