How to compute jaccard similarity from a pandas dataframe How to compute jaccard similarity from a pandas dataframe python python

How to compute jaccard similarity from a pandas dataframe


Short and vectorized (fast) answer:

Use 'hamming' from the pairwise distances of scikit learn:

from sklearn.metrics.pairwise import pairwise_distancesjac_sim = 1 - pairwise_distances(df.T, metric = "hamming")# optionally convert it to a DataFramejac_sim = pd.DataFrame(jac_sim, index=df.columns, columns=df.columns)

Explanation:

Assume this is your dataset:

import pandas as pdimport numpy as npnp.random.seed(0)df = pd.DataFrame(np.random.binomial(1, 0.5, size=(100, 5)), columns=list('ABCDE'))print(df.head())   A  B  C  D  E0  1  1  1  1  01  1  0  1  1  02  1  1  1  1  03  0  0  1  1  14  1  1  0  1  0

Using sklearn's jaccard_score, similarity between column A and B is:

from sklearn.metrics import jaccard_scoreprint(jaccard_score(df['A'], df['B']))0.43

This is the number of rows that have the same value over total number of rows, 100.

As far as I know, there is no pairwise version of the jaccard_score but there are pairwise versions of distances.

However, SciPy defines Jaccard distance as follows:

Given two vectors, u and v, the Jaccard distance is the proportion of those elements u[i] and v[i] that disagree where at least one of them is non-zero.

So it excludes the rows where both columns have 0 values. jaccard_score doesn't. Hamming distance, on the other hand, is inline with the similarity definition:

The proportion of those vector elements between two n-vectors u and vwhich disagree.

So if you want to calculate jaccard_score, you can use 1 - hamming:

from sklearn.metrics.pairwise import pairwise_distancesprint(1 - pairwise_distances(df.T, metric = "hamming"))array([[ 1.  ,  0.43,  0.61,  0.55,  0.46],       [ 0.43,  1.  ,  0.52,  0.56,  0.49],       [ 0.61,  0.52,  1.  ,  0.48,  0.53],       [ 0.55,  0.56,  0.48,  1.  ,  0.49],       [ 0.46,  0.49,  0.53,  0.49,  1.  ]])

In a DataFrame format:

jac_sim = 1 - pairwise_distances(df.T, metric = "hamming")jac_sim = pd.DataFrame(jac_sim, index=df.columns, columns=df.columns)# jac_sim = np.triu(jac_sim) to set the lower diagonal to zero# jac_sim = np.tril(jac_sim) to set the upper diagonal to zero      A     B     C     D     EA  1.00  0.43  0.61  0.55  0.46B  0.43  1.00  0.52  0.56  0.49C  0.61  0.52  1.00  0.48  0.53D  0.55  0.56  0.48  1.00  0.49E  0.46  0.49  0.53  0.49  1.00

You can do the same by iterating over combinations of columns but it will be much slower.

import itertoolssim_df = pd.DataFrame(np.ones((5, 5)), index=df.columns, columns=df.columns)for col_pair in itertools.combinations(df.columns, 2):    sim_df.loc[col_pair] = sim_df.loc[tuple(reversed(col_pair))] = jaccard_score(df[col_pair[0]], df[col_pair[1]])print(sim_df)      A     B     C     D     EA  1.00  0.43  0.61  0.55  0.46B  0.43  1.00  0.52  0.56  0.49C  0.61  0.52  1.00  0.48  0.53D  0.55  0.56  0.48  1.00  0.49E  0.46  0.49  0.53  0.49  1.00