ElasticSearch returning only documents with distinct value

java elasticsearch aggregate spring-data-elasticsearch nosql

You can eliminate duplicates using aggregations. With term aggregation the results will be grouped by one field, e.g. name, also providing a count of the ocurrences of each value of the field, and will sort the results by this count (descending).

{  "query": {    "fuzzy_like_this_field": {      "favorite_cars": {        "like_text": "toyota",        "max_query_terms": 12      }    }  },  "aggs": {    "grouped_by_name": {      "terms": {        "field": "name",        "size": 0      }    }  }}

In addition to the hits, the result will also contain the buckets with the unique values in key and with the count in doc_count:

{  "took" : 4,  "timed_out" : false,  "_shards" : {    "total" : 5,    "successful" : 5,    "failed" : 0  },  "hits" : {    "total" : 2,    "max_score" : 0.19178301,    "hits" : [ {      "_index" : "pru",      "_type" : "pru",      "_id" : "vGkoVV5cR8SN3lvbWzLaFQ",      "_score" : 0.19178301,      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}    }, {      "_index" : "pru",      "_type" : "pru",      "_id" : "IdEbAcI6TM6oCVxCI_3fug",      "_score" : 0.19178301,      "_source":{"name":"ABC","favorite_cars":["ferrari","toyota"]}    } ]  },  "aggregations" : {    "grouped_by_name" : {      "buckets" : [ {        "key" : "abc",        "doc_count" : 2      } ]    }  }}

Note that using aggregations will be costly because of duplicate elimination and result sorting.

java elasticsearch aggregate spring-data-elasticsearch nosql

ElasticSearch doesn't provide any query by which you can get distinct documents based a field value.

Ideally you should have indexed the same document with same type and id since these two things are used by ElasticSearch to give a _uid unique id to a document. Unique id is important not only because of its way of detecting duplicate documents but also updating the same document in case of any modification instead of inserting a new one. For more information about indexing documents you can read this.

But there is definitely a work around for your problem. Since you are using java api client, you can remove duplicate documents based on a field value on your own. Infact, it gives you more flexibility to perform custom operations on the responses that you get from ES.

SearchResponse response = client.prepareSearch().execute().actionGet();SearchHits hits = response.getHits();Iterator<SearchHit> iterator = hits.iterator();Map<String, SearchHit> distinctObjects = new HashMap<String,SearchHit>();while (iterator.hasNext()) {    SearchHit searchHit = (SearchHit) iterator.next();    Map<String, Object> source = searchHit.getSource();    if(source.get("name") != null){        distinctObjects.put(source.get("name").toString(),source);    }}

So, you will have a map of unique searchHit objects in your map.

You can also create an object mapping and use that in place of SearchHit.

I hope this solves your problem. Please forgive me if there are any errors in the code. This is just a pseudo-ish code to make you understand how you can solve your problem.

Thanks

java elasticsearch aggregate spring-data-elasticsearch nosql

@JRL is almost corrrect. You will need an aggregation in your query. This will get you a list of the top 10000 "favorite_cars" in your object ordered by occurance.

{    "query":{ "match_all":{ } },    "size":0,    "Distinct" : {        "Cars" : {            "terms" : { "field" : "favorite_cars", "order": { "_count": "desc"}, "size":10000 }         }    }}

It is also worth noting that you are going to want your "favorite_car" field to not be analyzed in order to get "McLaren F1" instead of "McLaren ", "F1".

"favorite_car": {    "type": "string",    "index": "not_analyzed"}

CodeHunter

ElasticSearch returning only documents with distinct value

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last