ElasticSearch returning only documents with distinct value
You can eliminate duplicates using aggregations. With term aggregation the results will be grouped by one field, e.g. name
, also providing a count of the ocurrences of each value of the field, and will sort the results by this count (descending).
{ "query": { "fuzzy_like_this_field": { "favorite_cars": { "like_text": "toyota", "max_query_terms": 12 } } }, "aggs": { "grouped_by_name": { "terms": { "field": "name", "size": 0 } } }}
In addition to the hits
, the result will also contain the buckets
with the unique values in key
and with the count in doc_count
:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.19178301, "hits" : [ { "_index" : "pru", "_type" : "pru", "_id" : "vGkoVV5cR8SN3lvbWzLaFQ", "_score" : 0.19178301, "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]} }, { "_index" : "pru", "_type" : "pru", "_id" : "IdEbAcI6TM6oCVxCI_3fug", "_score" : 0.19178301, "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]} } ] }, "aggregations" : { "grouped_by_name" : { "buckets" : [ { "key" : "abc", "doc_count" : 2 } ] } }}
Note that using aggregations will be costly because of duplicate elimination and result sorting.
ElasticSearch doesn't provide any query by which you can get distinct documents based a field value.
Ideally you should have indexed the same document with same type and id since these two things are used by ElasticSearch to give a _uid unique id to a document. Unique id is important not only because of its way of detecting duplicate documents but also updating the same document in case of any modification instead of inserting a new one. For more information about indexing documents you can read this.
But there is definitely a work around for your problem. Since you are using java api client, you can remove duplicate documents based on a field value on your own. Infact, it gives you more flexibility to perform custom operations on the responses that you get from ES.
SearchResponse response = client.prepareSearch().execute().actionGet();SearchHits hits = response.getHits();Iterator<SearchHit> iterator = hits.iterator();Map<String, SearchHit> distinctObjects = new HashMap<String,SearchHit>();while (iterator.hasNext()) { SearchHit searchHit = (SearchHit) iterator.next(); Map<String, Object> source = searchHit.getSource(); if(source.get("name") != null){ distinctObjects.put(source.get("name").toString(),source); }}
So, you will have a map of unique searchHit objects in your map.
You can also create an object mapping and use that in place of SearchHit.
I hope this solves your problem. Please forgive me if there are any errors in the code. This is just a pseudo-ish code to make you understand how you can solve your problem.
Thanks
@JRL is almost corrrect. You will need an aggregation in your query. This will get you a list of the top 10000 "favorite_cars" in your object ordered by occurance.
{ "query":{ "match_all":{ } }, "size":0, "Distinct" : { "Cars" : { "terms" : { "field" : "favorite_cars", "order": { "_count": "desc"}, "size":10000 } } }}
It is also worth noting that you are going to want your "favorite_car" field to not be analyzed in order to get "McLaren F1" instead of "McLaren ", "F1".
"favorite_car": { "type": "string", "index": "not_analyzed"}