Finding duplicates in Elasticsearch Finding duplicates in Elasticsearch elasticsearch elasticsearch

Finding duplicates in Elasticsearch


Your last approach seems to be the best one. And you can update your elasticsearch settings as following:

indices.breaker.request.limit: "75%"indices.breaker.total.limit: "85%"

I have chosen 75% because the default is 60% and it is 5.9gb in your elasticsearch and your query is becoming ~6.3gb which is around 71.1% based on your log.

circuit_breaking_exception: [request] Data too large, data for [<reused_arrays>] would be larger than limit of [6348236390/5.9gb]

And finally indices.breaker.total.limit must be greater than indices.breaker.fielddata.limit according to elasticsearch document.


An Idea that might work in a Logstash scenario is using copy fields:

Copy all combinations to a separate fields and concat them:

mutate {  add_field => {    "new_field" => "%{oldfield1} %{oldfield2}"  }}

aggregate over the new field.

Have a look here: https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html

I don't know if add_field supports array (others do if you look at the documentation). If it does not you could try to add several new fields and use merge to have just one field.

If you can do this at index time it would certanly be better.

You only need the combinations (A_B) and not all Permutations (A_B, B_A)