Extracting most important words from Elasticsearch index, using Node JS client Extracting most important words from Elasticsearch index, using Node JS client elasticsearch elasticsearch

Extracting most important words from Elasticsearch index, using Node JS client


Tf-Idf of documents is typically used to find the similarity of documents (using Cosine similarity, euclidean distance etc)

Tf or term frequency indicates frequency of a word in the document. Higher the frequency of the word, higher the importance of the word.

Idf or inverse document frequency indicates the number of documents(of input collection) that contains the word. More rare the word, higher the importance of the word.

If we just use TF to build document vector, we are prone to spam because common words(for eg: pronouns, conjunctions etc) gain more importance. Hence, combination of td-idf gives better meaning and indicates the real significance of the word. Or in other words to rank words of a document based on the significance, it is not advised to calculate just the tf of each word, instead use tf-idf on the entire input collection and rank based on the tf-idf value which shows the real significance of keywords.

Have a look at sample python solution for calculating tf-idf value for json tweets list and finding the similar tweets.

Github Sample


Elastic Search provides a very specific data aggregation which allow you to extract "Significant Keywords" for a subset of your Index [1]

To elaborate what is significant you need a foreground (the subset of docs you want to analyse) and a background (the entire corpus) .

As you may realize, to identify a term as significant you need to compare how is appearing in your corpus in comparison to something else ( for example a generic corpus).You may find some archive that contains a sort of general IDF score for terms ( Reuter corpus, brown corpus, wikipedia ect ect).Then you can :Foreground document set -> your corpusBackground document set -> generic corpus

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html