Finding duplicates in Elasticsearch
Your last approach seems to be the best one. And you can update your elasticsearch settings as following:
indices.breaker.request.limit: "75%"indices.breaker.total.limit: "85%"
I have chosen 75%
because the default is 60%
and it is 5.9gb
in your elasticsearch and your query is becoming ~6.3gb
which is around 71.1%
based on your log.
circuit_breaking_exception: [request] Data too large, data for [<reused_arrays>] would be larger than limit of [6348236390/5.9gb]
And finally indices.breaker.total.limit
must be greater than indices.breaker.fielddata.limit
according to elasticsearch document.
An Idea that might work in a Logstash scenario is using copy fields:
Copy all combinations to a separate fields and concat them:
mutate { add_field => { "new_field" => "%{oldfield1} %{oldfield2}" }}
aggregate over the new field.
Have a look here: https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html
I don't know if add_field supports array (others do if you look at the documentation). If it does not you could try to add several new fields and use merge to have just one field.
If you can do this at index time it would certanly be better.
You only need the combinations (A_B) and not all Permutations (A_B, B_A)