What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

python numpy pandas similarity cosine-similarity

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

from sklearn.metrics.pairwise import cosine_similarityfrom scipy import sparseA =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])A_sparse = sparse.csr_matrix(A)similarities = cosine_similarity(A_sparse)print('pairwise dense output:\n {}\n'.format(similarities))#also can output sparse matricessimilarities_sparse = cosine_similarity(A_sparse,dense_output=False)print('pairwise sparse output:\n {}\n'.format(similarities_sparse))

Results:

pairwise dense output:[[ 1.          0.40824829  0.40824829][ 0.40824829  1.          0.33333333][ 0.40824829  0.33333333  1.        ]]pairwise sparse output:(0, 1)  0.408248290464(0, 2)  0.408248290464(0, 0)  1.0(1, 0)  0.408248290464(1, 2)  0.333333333333(1, 1)  1.0(2, 1)  0.333333333333(2, 0)  0.408248290464(2, 2)  1.0

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

A_sparse.transpose()

python numpy pandas similarity cosine-similarity

The following method is about 30 times faster than scipy.spatial.distance.pdist. It works pretty quickly on large matrices (assuming you have enough RAM)

See below for a discussion of how to optimize for sparsity.

# base similarity matrix (all dot products)# replace this with A.dot(A.T).toarray() for sparse representationsimilarity = numpy.dot(A, A.T)# squared magnitude of preference vectors (number of occurrences)square_mag = numpy.diag(similarity)# inverse squared magnitudeinv_square_mag = 1 / square_mag# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)inv_square_mag[numpy.isinf(inv_square_mag)] = 0# inverse of the magnitudeinv_mag = numpy.sqrt(inv_square_mag)# cosine similarity (elementwise multiply by inverse magnitudes)cosine = similarity * inv_magcosine = cosine.T * inv_mag

If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let's call this dimension the 'item' dimension.

If this is the case, list your 'items' in rows and create A using scipy.sparse. Then replace the first line as indicated.

If your problem is atypical you'll need more modifications. Those should be pretty straightforward replacements of basic numpy operations with their scipy.sparse equivalents.

python numpy pandas similarity cosine-similarity

I have tried some methods above. However, the experiment by @zbinsd has its limitation. The sparsity of matrix used in the experiment is extremely low while the real sparsity is usually over 90%.In my condition, the sparse is with the shape of (7000, 25000) and the sparsity of 97%. The method 4 is extremely slow and I can't tolerant getting the results. I use the method 6 which is finished in 10 s. Amazingly, I try the method below and it's finished in only 0.247 s.

import sklearn.preprocessing as ppdef cosine_similarities(mat):    col_normed_mat = pp.normalize(mat.tocsc(), axis=0)    return col_normed_mat.T * col_normed_mat

This efficient method is linked by enter link description here

CodeHunter

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last