Jaccard Similarity for Texts in a pandas DataFrame

python pandas scikit-learn similarity sklearn-pandas

One way to speed up the process could be parallel processing using Pandas on Ray.

You can try NLTK implementation of jaccard_distance for jaccard similarity. I couldn't find any significant improvement in processing time though(for calculating similarity), may work out better on a larger dataset.

Tried comparing NLTK implementation to your custom jaccard similarity function (on 200 text samples of average length 4 words/tokens)

NTLK jaccard_distance:

CPU times: user 3.3 s, sys: 30.3 ms, total: 3.34 sWall time: 3.38 s

Custom jaccard similarity implementation:

CPU times: user 3.67 s, sys: 19.2 ms, total: 3.69 sWall time: 3.71 s

CodeHunter

Jaccard Similarity for Texts in a pandas DataFrame

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last