Efficiently Computing Significant Terms in SQL Efficiently Computing Significant Terms in SQL elasticsearch elasticsearch

Efficiently Computing Significant Terms in SQL


I doubt a SQL impl will be faster.The values for C and T are maintained ahead of time by Lucene.S is a simple count derived from the query results and I is looked up using O(1) data structures. The main cost are the many T lookups for each of the terms observed in the chosen field. Using min_doc_count typically helps drastically reduce the number of these lookups.

For practical reasons (the sheer amount of data I have and huge ElasticSearch memory requirements

Have you looked into using doc values to manage elasticsearch memory better? See https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale


An efficient solution is possible for the case when the foreground set is small enough. Then you can afford processing all documents in the foreground set.

  1. Collect the set {Xk} of all terms occurring in the foreground set for the chosen field, as well as their frequencies {fk} in the foreground set.

  2. For each Xk

    • Calculate the significance of Xk as (fk - Fk) * (fk / Fk), where Fk=Tk/C is the frequency of Xk in the background set.
  3. Select the terms with the highest significance values.

However, due to the simplicity of this approach, I wonder if ElasticSearch already contains that optimization. If it doesn't - then it very soon will!