String similarity metrics in Python String similarity metrics in Python python python

String similarity metrics in Python


I realize it's not the same thing, but this is close enough:

>>> import difflib>>> a = 'Hello, All you people'>>> b = 'hello, all You peopl'>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())>>> seq.ratio()0.97560975609756095

You can make this as a function

def similar(seq1, seq2):    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9>>> similar(a, b)True>>> similar('Hello, world', 'Hi, world')False


There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

  • Hamming distance
  • Levenshtein distance
  • Needleman-Wunch distance or Sellers Algorithm
  • and many more...


This snippet will calculate the difflib, Levenshtein, Sørensen, and Jaccard similarity values for two strings. In the snippet below, I was iterating over a tsv in which the strings of interest occupied columns [3] and [4] of the tsv. (pip install python-Levenshtein and pip install distance):

import codecs, difflib, Levenshtein, distancewith codecs.open("titles.tsv","r","utf-8") as f:    title_list = f.read().split("\n")[:-1]    for row in title_list:        sr      = row.lower().split("\t")        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()        lev     = Levenshtein.ratio(sr[3], sr[4])         sor     = 1 - distance.sorensen(sr[3], sr[4])        jac     = 1 - distance.jaccard(sr[3], sr[4])        print diffl, lev, sor, jac