How to compute jaccard similarity from a pandas dataframe

Short and vectorized (fast) answer:

Use 'hamming' from the pairwise distances of scikit learn:

from sklearn.metrics.pairwise import pairwise_distancesjac_sim = 1 - pairwise_distances(df.T, metric = "hamming")# optionally convert it to a DataFramejac_sim = pd.DataFrame(jac_sim, index=df.columns, columns=df.columns)

Explanation:

Assume this is your dataset:

import pandas as pdimport numpy as npnp.random.seed(0)df = pd.DataFrame(np.random.binomial(1, 0.5, size=(100, 5)), columns=list('ABCDE'))print(df.head())   A  B  C  D  E0  1  1  1  1  01  1  0  1  1  02  1  1  1  1  03  0  0  1  1  14  1  1  0  1  0

Using sklearn's jaccard_score, similarity between column A and B is:

from sklearn.metrics import jaccard_scoreprint(jaccard_score(df['A'], df['B']))0.43

This is the number of rows that have the same value over total number of rows, 100.

As far as I know, there is no pairwise version of the jaccard_score but there are pairwise versions of distances.

However, SciPy defines Jaccard distance as follows:

Given two vectors, u and v, the Jaccard distance is the proportion of those elements u[i] and v[i] that disagree where at least one of them is non-zero.

So it excludes the rows where both columns have 0 values. jaccard_score doesn't. Hamming distance, on the other hand, is inline with the similarity definition:

The proportion of those vector elements between two n-vectors u and vwhich disagree.

So if you want to calculate jaccard_score, you can use 1 - hamming:

from sklearn.metrics.pairwise import pairwise_distancesprint(1 - pairwise_distances(df.T, metric = "hamming"))array([[ 1.  ,  0.43,  0.61,  0.55,  0.46],       [ 0.43,  1.  ,  0.52,  0.56,  0.49],       [ 0.61,  0.52,  1.  ,  0.48,  0.53],       [ 0.55,  0.56,  0.48,  1.  ,  0.49],       [ 0.46,  0.49,  0.53,  0.49,  1.  ]])

In a DataFrame format:

jac_sim = 1 - pairwise_distances(df.T, metric = "hamming")jac_sim = pd.DataFrame(jac_sim, index=df.columns, columns=df.columns)# jac_sim = np.triu(jac_sim) to set the lower diagonal to zero# jac_sim = np.tril(jac_sim) to set the upper diagonal to zero      A     B     C     D     EA  1.00  0.43  0.61  0.55  0.46B  0.43  1.00  0.52  0.56  0.49C  0.61  0.52  1.00  0.48  0.53D  0.55  0.56  0.48  1.00  0.49E  0.46  0.49  0.53  0.49  1.00

You can do the same by iterating over combinations of columns but it will be much slower.

import itertoolssim_df = pd.DataFrame(np.ones((5, 5)), index=df.columns, columns=df.columns)for col_pair in itertools.combinations(df.columns, 2):    sim_df.loc[col_pair] = sim_df.loc[tuple(reversed(col_pair))] = jaccard_score(df[col_pair[0]], df[col_pair[1]])print(sim_df)      A     B     C     D     EA  1.00  0.43  0.61  0.55  0.46B  0.43  1.00  0.52  0.56  0.49C  0.61  0.52  1.00  0.48  0.53D  0.55  0.56  0.48  1.00  0.49E  0.46  0.49  0.53  0.49  1.00

CodeHunter

How to compute jaccard similarity from a pandas dataframe

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last