Python: tf-idf-cosine: to find document similarity Python: tf-idf-cosine: to find document similarity python python

Python: tf-idf-cosine: to find document similarity


First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer:

>>> from sklearn.feature_extraction.text import TfidfVectorizer>>> from sklearn.datasets import fetch_20newsgroups>>> twenty = fetch_20newsgroups()>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)>>> tfidf<11314x130088 sparse matrix of type '<type 'numpy.float64'>'    with 1787553 stored elements in Compressed Sparse Row format>

Now to find the cosine distances of one document (e.g. the first in the dataset) and all of the others you just need to compute the dot products of the first vector with all of the others as the tfidf vectors are already row-normalized.

As explained by Chris Clark in comments and here Cosine Similarity does not take into account the magnitude of the vectors. Row-normalised have a magnitude of 1 and so the Linear Kernel is sufficient to calculate the similarity values.

The scipy sparse matrix API is a bit weird (not as flexible as dense N-dimensional numpy arrays). To get the first vector you need to slice the matrix row-wise to get a submatrix with a single row:

>>> tfidf[0:1]<1x130088 sparse matrix of type '<type 'numpy.float64'>'    with 89 stored elements in Compressed Sparse Row format>

scikit-learn already provides pairwise metrics (a.k.a. kernels in machine learning parlance) that work for both dense and sparse representations of vector collections. In this case we need a dot product that is also known as the linear kernel:

>>> from sklearn.metrics.pairwise import linear_kernel>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()>>> cosine_similaritiesarray([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,    0.04457106,  0.03293218])

Hence to find the top 5 related documents, we can use argsort and some negative array slicing (most related documents have highest cosine similarity values, hence at the end of the sorted indices array):

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]>>> related_docs_indicesarray([    0,   958, 10576,  3277])>>> cosine_similarities[related_docs_indices]array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

The first result is a sanity check: we find the query document as the most similar document with a cosine similarity score of 1 which has the following text:

>>> print twenty.data[0]From: lerxst@wam.umd.edu (where's my thing)Subject: WHAT car is this!?Nntp-Posting-Host: rac3.wam.umd.eduOrganization: University of Maryland, College ParkLines: 15 I was wondering if anyone out there could enlighten me on this car I sawthe other day. It was a 2-door sports car, looked to be from the late 60s/early 70s. It was called a Bricklin. The doors were really small. In addition,the front bumper was separate from the rest of the body. This isall I know. If anyone can tellme a model name, engine specs, yearsof production, where this car is made, history, or whatever info youhave on this funky looking car, please e-mail.Thanks,- IL   ---- brought to you by your neighborhood Lerxst ----

The second most similar document is a reply that quotes the original message hence has many common words:

>>> print twenty.data[958]From: rseymour@reed.edu (Robert Seymour)Subject: Re: WHAT car is this!?Article-I.D.: reed.1993Apr21.032905.29286Reply-To: rseymour@reed.eduOrganization: Reed College, Portland, ORLines: 26In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's mything) writes:>>  I was wondering if anyone out there could enlighten me on this car I saw> the other day. It was a 2-door sports car, looked to be from the late 60s/> early 70s. It was called a Bricklin. The doors were really small. Inaddition,> the front bumper was separate from the rest of the body. This is> all I know. If anyone can tellme a model name, engine specs, years> of production, where this car is made, history, or whatever info you> have on this funky looking car, please e-mail.Bricklins were manufactured in the 70s with engines from Ford. They are ratherodd looking with the encased front bumper. There aren't a lot of them around,but Hemmings (Motor News) ususally has ten or so listed. Basically, they are aperformance Ford with new styling slapped on top.>    ---- brought to you by your neighborhood Lerxst ----Rush fan?--Robert Seymour              rseymour@reed.eduPhysics and Philosophy, Reed College    (NeXTmail accepted)Artificial Life Project         Reed CollegeReed Solar Energy Project (SolTrain)    Portland, OR


WIth the Help of @excray's comment, I manage to figure it out the answer, What we need to do is actually write a simple for loop to iterate over the two arrays that represent the train data and test data.

First implement a simple lambda function to hold formula for the cosine calculation:

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

And then just write a simple for loop to iterate over the to vector, logic is for every "For each vector in trainVectorizerArray, you have to find the cosine similarity with the vector in testVectorizerArray."

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom nltk.corpus import stopwordsimport numpy as npimport numpy.linalg as LAtrain_set = ["The sky is blue.", "The sun is bright."] #Documentstest_set = ["The sun in the sky is bright."] #QuerystopWords = stopwords.words('english')vectorizer = CountVectorizer(stop_words = stopWords)#print vectorizertransformer = TfidfTransformer()#print transformertrainVectorizerArray = vectorizer.fit_transform(train_set).toarray()testVectorizerArray = vectorizer.transform(test_set).toarray()print 'Fit Vectorizer to train set', trainVectorizerArrayprint 'Transform Vectorizer to test set', testVectorizerArraycx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)for vector in trainVectorizerArray:    print vector    for testV in testVectorizerArray:        print testV        cosine = cx(vector, testV)        print cosinetransformer.fit(trainVectorizerArray)printprint transformer.transform(trainVectorizerArray).toarray()transformer.fit(testVectorizerArray)print tfidf = transformer.transform(testVectorizerArray)print tfidf.todense()

Here is the output:

Fit Vectorizer to train set [[1 0 1 0] [0 1 0 1]]Transform Vectorizer to test set [[0 1 1 1]][1 0 1 0][0 1 1 1]0.408[0 1 0 1][0 1 1 1]0.816[[ 0.70710678  0.          0.70710678  0.        ] [ 0.          0.70710678  0.          0.70710678]][[ 0.          0.57735027  0.57735027  0.57735027]]


I know its an old post. but I tried the http://scikit-learn.sourceforge.net/stable/ package. here is my code to find the cosine similarity. The question was how will you calculate the cosine similarity with this package and here is my code for that

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.metrics.pairwise import cosine_similarityfrom sklearn.feature_extraction.text import TfidfVectorizerf = open("/root/Myfolder/scoringDocuments/doc1")doc1 = str.decode(f.read(), "UTF-8", "ignore")f = open("/root/Myfolder/scoringDocuments/doc2")doc2 = str.decode(f.read(), "UTF-8", "ignore")f = open("/root/Myfolder/scoringDocuments/doc3")doc3 = str.decode(f.read(), "UTF-8", "ignore")train_set = ["president of India",doc1, doc2, doc3]tfidf_vectorizer = TfidfVectorizer()tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)  #finds the tfidf score with normalizationprint "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train)  #here the first element of tfidf_matrix_train is matched with other three elements

Here suppose the query is the first element of train_set and doc1,doc2 and doc3 are the documents which I want to rank with the help of cosine similarity. then I can use this code.

Also the tutorials provided in the question was very useful. Here are all the parts for it part-I,part-II,part-III

the output will be as follows :

[[ 1.          0.07102631  0.02731343  0.06348799]]

here 1 represents that query is matched with itself and the other three are the scores for matching the query with the respective documents.