High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

python string-matching levenshtein-distance difflib

In case you're interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles:

import codecs, difflib, Levenshtein, distancewith codecs.open("titles.tsv","r","utf-8") as f:    title_list = f.read().split("\n")[:-1]    for row in title_list:        sr      = row.lower().split("\t")        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()        lev     = Levenshtein.ratio(sr[3], sr[4])         sor     = 1 - distance.sorensen(sr[3], sr[4])        jac     = 1 - distance.jaccard(sr[3], sr[4])        print diffl, lev, sor, jac

I then plotted the results with R:

enter image description here

Strictly for the curious, I also compared the Difflib, Levenshtein, Sørensen, and Jaccard similarity values:

library(ggplot2)require(GGally)difflib <- read.table("similarity_measures.txt", sep = " ")colnames(difflib) <- c("difflib", "levenshtein", "sorensen", "jaccard")ggpairs(difflib)

Result: enter image description here

The Difflib / Levenshtein similarity really is quite interesting.

2018 edit: If you're working on identifying similar strings, you could also check out minhashing--there's a great overview here. Minhashing is amazing at finding similarities in large text collections in linear time. My lab put together an app that detects and visualizes text reuse using minhashing here: https://github.com/YaleDHLab/intertext

python string-matching levenshtein-distance difflib

difflib.SequenceMatcher uses the Ratcliff/Obershelp algorithm it computes the doubled number of matching characters divided by the total number of characters in the two strings.
Levenshtein uses Levenshtein algorithm it computes the minimum number of edits needed to transform one string into the other

Complexity

SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common. (from here)

Levenshtein is O(m*n), where n and m are the length of the two input strings.

Performance

According to the source code of the Levenshtein module :Levenshtein has a some overlap with difflib (SequenceMatcher). It supports only strings, not arbitrary sequence types, but on the other hand it's much faster.

CodeHunter

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last