Fuzzy String Comparison Fuzzy String Comparison python python

Fuzzy String Comparison


There is a package called fuzzywuzzy. Install via pip:

pip install fuzzywuzzy

Simple usage:

>>> from fuzzywuzzy import fuzz>>> fuzz.ratio("this is a test", "this is a test!")    96

The package is built on top of difflib. Why not just use that, you ask? Apart from being a bit simpler, it has a number of different matching methods (like token order insensitivity, partial string matching) which make it more powerful in practice. The process.extract functions are especially useful: find the best matching strings and ratios from a set. From their readme:

Partial Ratio

>>> fuzz.partial_ratio("this is a test", "this is a test!")    100

Token Sort Ratio

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")    90>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")    100

Token Set Ratio

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")    84>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")    100

Process

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]>>> process.extract("new york jets", choices, limit=2)    [('New York Jets', 100), ('New York Giants', 78)]>>> process.extractOne("cowboys", choices)    ("Dallas Cowboys", 90)


There is a module in the standard library (called difflib) that can compare strings and return a score based on their similarity. The SequenceMatcher class should do what you are after.

EDIT: Small example from python prompt:

>>> from difflib import SequenceMatcher as SM>>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.'>>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.'>>> SM(None, s1, s2).ratio()0.9112903225806451

HTH!


fuzzyset is much faster than fuzzywuzzy (difflib) for both indexing and searching.

from fuzzyset import FuzzySetcorpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines    It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines    I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.    It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""corpus = [line.lstrip() for line in corpus.split("\n")]fs = FuzzySet(corpus)query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."fs.get(query)# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]

Warning: Be careful not to mix unicode and bytes in your fuzzyset.