String similarity metrics in Python

python algorithm string levenshtein-distance

I realize it's not the same thing, but this is close enough:

>>> import difflib>>> a = 'Hello, All you people'>>> b = 'hello, all You peopl'>>> seq=difflib.SequenceMatcher(a=a.lower(), b=b.lower())>>> seq.ratio()0.97560975609756095

You can make this as a function

def similar(seq1, seq2):    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9>>> similar(a, b)True>>> similar('Hello, world', 'Hi, world')False

python algorithm string levenshtein-distance

There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

Hamming distance
Levenshtein distance
Needleman-Wunch distance or Sellers Algorithm
and many more...

python algorithm string levenshtein-distance

This snippet will calculate the difflib, Levenshtein, Sørensen, and Jaccard similarity values for two strings. In the snippet below, I was iterating over a tsv in which the strings of interest occupied columns [3] and [4] of the tsv. (pip install python-Levenshtein and pip install distance):

import codecs, difflib, Levenshtein, distancewith codecs.open("titles.tsv","r","utf-8") as f:    title_list = f.read().split("\n")[:-1]    for row in title_list:        sr      = row.lower().split("\t")        diffl   = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio()        lev     = Levenshtein.ratio(sr[3], sr[4])         sor     = 1 - distance.sorensen(sr[3], sr[4])        jac     = 1 - distance.jaccard(sr[3], sr[4])        print diffl, lev, sor, jac

CodeHunter

String similarity metrics in Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last