Efficiently Computing Significant Terms in SQL

sql elasticsearch query-optimization aggregation significant-terms

I doubt a SQL impl will be faster.The values for C and T are maintained ahead of time by Lucene.S is a simple count derived from the query results and I is looked up using O(1) data structures. The main cost are the many T lookups for each of the terms observed in the chosen field. Using min_doc_count typically helps drastically reduce the number of these lookups.

For practical reasons (the sheer amount of data I have and huge ElasticSearch memory requirements

Have you looked into using doc values to manage elasticsearch memory better? See https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale

sql elasticsearch query-optimization aggregation significant-terms

An efficient solution is possible for the case when the foreground set is small enough. Then you can afford processing all documents in the foreground set.

Collect the set {X_k} of all terms occurring in the foreground set for the chosen field, as well as their frequencies {f_k} in the foreground set.
For each X_k
- Calculate the significance of X_k as (f_k - F_k) * (f_k / F_k), where F_k=T_k/C is the frequency of X_k in the background set.
Select the terms with the highest significance values.

However, due to the simplicity of this approach, I wonder if ElasticSearch already contains that optimization. If it doesn't - then it very soon will!

CodeHunter

Efficiently Computing Significant Terms in SQL

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last