Interpreting the sum of TF-IDF scores of words across documents Interpreting the sum of TF-IDF scores of words across documents python python

Interpreting the sum of TF-IDF scores of words across documents


The interpretation of TF-IDF in corpus is the highest TF-IDF in corpus for a given term.

Find the Top Words in corpus_tfidf.

    topWords = {}    for doc in corpus_tfidf:        for iWord, tf_idf in doc:            if iWord not in topWords:                topWords[iWord] = 0            if tf_idf > topWords[iWord]:                topWords[iWord] = tf_idf    for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):        print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))        if i == 6: break

Output comparison cart:
NOTE: Could'n use gensim, to create a matching dictionary with corpus_tfidf.
Can only display Word Indizies.

Question tfidf_saliency   topWords(corpus_tfidf)  Other TF-IDF implentation  ---------------------------------------------------------------------------  1: Word(7)   0.121        1: Word(13)    0.640    1: paths         0.376019  2: Word(8)   0.111        2: Word(27)    0.632    2: intersection  0.376019  3: Word(26)  0.108        3: Word(28)    0.632    3: survey        0.366204  4: Word(29)  0.100        4: Word(8)     0.628    4: minors        0.366204  5: Word(9)   0.090        5: Word(29)    0.628    5: binary        0.300815  6: Word(14)  0.087        6: Word(11)    0.544    6: generation    0.300815  

The calculation of TF-IDF takes always the corpus in account.

Tested with Python:3.4.2


This is a great discussion. Thanks for starting this thread. The idea of including document length by @avip seems interesting. Will have to experiment and check on the results. In the meantime let me try asking the question a little differently. What are we trying to interpret when querying for TF-IDF relevance scores ?

  1. Possibly trying to understand the word relevance at the document level
  2. Possibly trying to understand the word relevance per Class
  3. Possibly trying to understand the word relevance overall ( in the whole corpus )

     # # features, corpus = 6 documents of length 3 counts = [[3, 0, 1],           [2, 0, 0],           [3, 0, 0],           [4, 0, 0],           [3, 2, 0],           [3, 0, 2]] from sklearn.feature_extraction.text import TfidfTransformer transformer = TfidfTransformer(smooth_idf=False) tfidf = transformer.fit_transform(counts) print(tfidf.toarray()) # lambda for basic stat computation summarizer_default = lambda x: np.sum(x, axis=0) summarizer_mean = lambda x: np.mean(x, axis=0) print(summarizer_default(tfidf)) print(summarizer_mean(tfidf))

Result:

# Result post computing TF-IDF relevance scoresarray([[ 0.81940995,  0.        ,  0.57320793],           [ 1.        ,  0.        ,  0.        ],           [ 1.        ,  0.        ,  0.        ],           [ 1.        ,  0.        ,  0.        ],           [ 0.47330339,  0.88089948,  0.        ],           [ 0.58149261,  0.        ,  0.81355169]])# Result post aggregation (Sum, Mean) [[ 4.87420595  0.88089948  1.38675962]][[ 0.81236766  0.14681658  0.2311266 ]]

If we observe closely, we realize the the feature1 witch occurred in all the document is not ignored completely because the sklearn implementation of idf = log [ n / df(d, t) ] + 1. +1 is added so that the important word which just so happens to occur in all document is not ignored. E.g. the word 'bike' occurring very frequently in classifying a particular document as 'motorcyle' (20_newsgroup dataset).

  1. Now in-regards to the first 2 questions, one is trying to interpret and understand the top common features that might be occurring in the document. In that case, aggregating in some form including all possible occurrence of the word in a doc is not taking anything away even mathematically. IMO such a query is very useful exploring the dataset and helping understanding what the dataset is about. The logic might be applied to vectorizing using Hashing as well.

    relevance_score = mean(tf(t,d) * idf(t,d)) = mean( (bias + inital_wt * F(t,d) / max{F(t',d)}) * log(N/df(d, t)) + 1 ))

  2. Question3 is very important as it might as well be contributing to features being selected for building a predictive model. Just using TF-IDF scores independently for feature selection might be misleading at multiple level. Adopting a more theoretical statistical test such as 'chi2' couple with TF-IDF relevance scores might be a better approach. Such statistical test also evaluates the importance of the feature in relation to the respective target class.

And of-course combining such interpretation with the model's learned feature weights would be very helpful in understanding the importance of text derived features completely.

** The problem is a little more complex to cover in detail here. But, hoping the above helps. What do others feel ?

Reference: https://arxiv.org/abs/1707.05261


There is two context that saliency can be calculated in them.

  1. saliency in the corpus
  2. saliency in a single document

saliency in the corpus simply can be calculated by counting the appearances of particular word in corpus or by inverse of the counting of the documents that word appears in (IDF=Inverted Document Frequency). Because the words that hold the specific meaning does not appear in everywhere.

saliency in the document is calculated by tf_idf. Because that is composed of two kinds of information. Global information (corpus-based) and local information (document-based).Claiming that "the word with larger in-document frequency is more important in current document" is not completely true or false because it depends on the global saliency of word. In a particular document you have many words like "it, is, am, are ,..." with large frequencies. But these word is not important in any document and you can take them as stop words!

---- edit ---

The denominator (=len(corpus_tfidf)) is a constant value and could be ignored if you want to deal with ordinality rather than cardinality of measurement. On the other hand we know that IDF means Inverted Document Freqeuncy so IDF can be reoresented by 1/DF. We know that the DF is a corpus level value and TF is document level-value. TF-IDF Summation turns document-level TF into Corpus-level TF. Indeed the summation is equal to this formula:

count ( word ) / count ( documents contain word)

This measurement can be called inverse-scattering value. When the value goes up means the words is gathered into smaller subset of documents and vice versa.

I believe that this formula is not so useful.