Efficiently Computing Significant Terms in SQL
I doubt a SQL impl will be faster.The values for C and T are maintained ahead of time by Lucene.S is a simple count derived from the query results and I is looked up using O(1) data structures. The main cost are the many T lookups for each of the terms observed in the chosen field. Using min_doc_count typically helps drastically reduce the number of these lookups.
For practical reasons (the sheer amount of data I have and huge ElasticSearch memory requirements
Have you looked into using doc values to manage elasticsearch memory better? See https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale
An efficient solution is possible for the case when the foreground set is small enough. Then you can afford processing all documents in the foreground set.
Collect the set {Xk} of all terms occurring in the foreground set for the chosen field, as well as their frequencies {fk} in the foreground set.
For each Xk
- Calculate the significance of Xk as (fk - Fk) * (fk / Fk), where Fk=Tk/C is the frequency of Xk in the background set.
Select the terms with the highest significance values.
However, due to the simplicity of this approach, I wonder if ElasticSearch already contains that optimization. If it doesn't - then it very soon will!