Fuzzy String Comparison
There is a package called fuzzywuzzy
. Install via pip:
pip install fuzzywuzzy
Simple usage:
>>> from fuzzywuzzy import fuzz>>> fuzz.ratio("this is a test", "this is a test!") 96
The package is built on top of difflib
. Why not just use that, you ask? Apart from being a bit simpler, it has a number of different matching methods (like token order insensitivity, partial string matching) which make it more powerful in practice. The process.extract
functions are especially useful: find the best matching strings and ratios from a set. From their readme:
Partial Ratio
>>> fuzz.partial_ratio("this is a test", "this is a test!") 100
Token Sort Ratio
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 90>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") 100
Token Set Ratio
>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 84>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear") 100
Process
>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]>>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)]>>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90)
There is a module in the standard library (called difflib
) that can compare strings and return a score based on their similarity. The SequenceMatcher
class should do what you are after.
EDIT: Small example from python prompt:
>>> from difflib import SequenceMatcher as SM>>> s1 = ' It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.'>>> s2 = ' It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines.'>>> SM(None, s1, s2).ratio()0.9112903225806451
HTH!
fuzzyset
is much faster than fuzzywuzzy
(difflib
) for both indexing and searching.
from fuzzyset import FuzzySetcorpus = """It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night. It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats."""corpus = [line.lstrip() for line in corpus.split("\n")]fs = FuzzySet(corpus)query = "It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats."fs.get(query)# [(0.873015873015873, 'It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines')]
Warning: Be careful not to mix unicode
and bytes
in your fuzzyset.