How to get tfidf with pandas dataframe? How to get tfidf with pandas dataframe? python python

How to get tfidf with pandas dataframe?


Scikit-learn implementation is really easy :

from sklearn.feature_extraction.text import TfidfVectorizerv = TfidfVectorizer()x = v.fit_transform(df['sent'])

There are plenty of parameters you can specify. See the documentation here

The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()

In [44]: x.toarray()Out[44]: array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,         0.        ,  0.38161415],       [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,         0.        ,  0.38161415],       [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,         0.64612892,  0.38161415]])


A simple solution is to use texthero:

import texthero as herodf['tfidf'] = hero.tfidf(df['sent'])
In [5]: df.head()Out[5]:   docId                         sent                                              tfidf0      1   This is the first sentence  [0.3816141458138271, 0.6461289150464732, 0.381...1      2  This is the second sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...2      3   This is the third sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...


I found a slightly different method using CountVectorizer from sklearn.--count vectorizer: Ultraviolet Analysis word frequency--preprocessing/cleaning text: Usman Malik scraping tweets preprocessingI won't be covering preprocessing in this answer. Basically what you want to do is import CountVectorizer and fit your data to the CountVectorizer object, which will let you access the .vocabulary._items() feature, which will give you the vocabulary of your dataset (the unique words present and their frequencies, given any limiting parameters you pass into CountVectorizer like match feature number, etc)

Then, you're going to use the Tfidtransformer to generate tf-idf weights for the terms in a similar manner

I am coding in a jupyter notebook file using pandas and the pycharm ide

Here is a code snippet:

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerimport numpy as np#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.htmlcountVec = CountVectorizer(max_features= 5000, stop_words='english', min_df=.01, max_df=.90)#%%#use CountVectorizer.fit(self, raw_documents[, y] to learn vocabulary dictionary of all tokens in raw documents#raw documents in this case will betweetsFrameWords["Text"] (processed text)countVec.fit(tweetsFrameWords["Text"])#useful debug, get an idea of the item list you generatedlist(countVec.vocabulary_.items())#%%#convert to bag of words#sparse matrix representation? (README: could use an edit/explanation)countVec_count = countVec.transform(tweetsFrameWords["Text"])#%%#make array from number of occurrencesocc = np.asarray(countVec_count.sum(axis=0)).ravel().tolist()#make a new data frame with columns term and occurrences, meaning word and number of occurencesbowListFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'occurrences': occ})print(bowListFrame)#sort in order of number of word occurences, most->least. if you leave of ascending flag should default ASCbowListFrame.sort_values(by='occurrences', ascending=False).head(60)#%%#now, convert to a more useful ranking system, tf-idf weights#TfidfTransformer: scale raw word counts to a weighted ranking using the#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.htmltweetTransformer = TfidfTransformer()#initial fit representation using transformer objecttweetWeights = tweetTransformer.fit_transform(countVec_count)#follow similar process to making new data frame with word occurrences, but with term weightstweetWeightsFin = np.asarray(tweetWeights.mean(axis=0)).ravel().tolist()#now that we've done Tfid, make a dataframe with weights and namestweetWeightFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'weight': tweetWeightsFin})print(tweetWeightFrame)tweetWeightFrame.sort_values(by='weight', ascending=False).head(20)