Good Python modules for fuzzy string comparison? [closed] Good Python modules for fuzzy string comparison? [closed] python python

Good Python modules for fuzzy string comparison? [closed]


difflib can do it.

Example from the docs:

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])['apple', 'ape']>>> import keyword>>> get_close_matches('wheel', keyword.kwlist)['while']>>> get_close_matches('apple', keyword.kwlist)[]>>> get_close_matches('accept', keyword.kwlist)['except']

Check it out. It has other functions that can help you build something custom.


Levenshtein Python extension and C library.

https://github.com/ztane/python-Levenshtein/

The Levenshtein Python C extension module contains functions for fastcomputation of- Levenshtein (edit) distance, and edit operations- string similarity- approximate median strings, and generally string averaging- string sequence and set similarityIt supports both normal and Unicode strings.

$ pip install python-levenshtein...$ python>>> import Levenshtein>>> help(Levenshtein.ratio)ratio(...)    Compute similarity of two strings.    ratio(string1, string2)    The similarity is a number between 0 and 1, it's usually equal or    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's    based on real minimal edit distance.    Examples:    >>> ratio('Hello world!', 'Holly grail!')    0.58333333333333337    >>> ratio('Brian', 'Jesus')    0.0>>> help(Levenshtein.distance)distance(...)    Compute absolute Levenshtein distance of two strings.    distance(string1, string2)    Examples (it's hard to spell Levenshtein correctly):    >>> distance('Levenshtein', 'Lenvinsten')    4    >>> distance('Levenshtein', 'Levensthein')    2    >>> distance('Levenshtein', 'Levenshten')    1    >>> distance('Levenshtein', 'Levenshtein')    0


As nosklo said, use the difflib module from the Python standard library.

The difflib module can return a measure of the sequences' similarity using the ratio() method of a SequenceMatcher() object. The similarity is returned as a float in the range 0.0 to 1.0.

>>> import difflib>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()1.0>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()0.80000000000000004>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()0.0