String similarity metrics in Python
I realize it's not the same thing, but this is close enough:
>>> import difflib>>> a = 'Hello, All you people'>>> b = 'hello, all You peopl'>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())>>> seq.ratio()0.97560975609756095
You can make this as a function
def similar(seq1, seq2): return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9>>> similar(a, b)True>>> similar('Hello, world', 'Hi, world')False
There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.
http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Here's a bit of the list:
- Hamming distance
- Levenshtein distance
- Needleman-Wunch distance or Sellers Algorithm
- and many more...
This snippet will calculate the difflib, Levenshtein, Sørensen, and Jaccard similarity values for two strings. In the snippet below, I was iterating over a tsv in which the strings of interest occupied columns [3]
and [4]
of the tsv. (pip install python-Levenshtein
and pip install distance
):
import codecs, difflib, Levenshtein, distancewith codecs.open("titles.tsv","r","utf-8") as f: title_list = f.read().split("\n")[:-1] for row in title_list: sr = row.lower().split("\t") diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio() lev = Levenshtein.ratio(sr[3], sr[4]) sor = 1 - distance.sorensen(sr[3], sr[4]) jac = 1 - distance.jaccard(sr[3], sr[4]) print diffl, lev, sor, jac