Checking fuzzy/approximate substring existing in a longer string, in Python?
How about using difflib.SequenceMatcher.get_matching_blocks
?
>>> import difflib>>> large_string = "thelargemanhatanproject">>> query_string = "manhattan">>> s = difflib.SequenceMatcher(None, large_string, query_string)>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))0.8888888888888888>>> query_string = "banana">>> s = difflib.SequenceMatcher(None, large_string, query_string)>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))0.6666666666666666
UPDATE
import difflibdef matches(large_string, query_string, threshold): words = large_string.split() for word in words: s = difflib.SequenceMatcher(None, word, query_string) match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n) if len(match) / float(len(query_string)) >= threshold: yield matchlarge_string = "thelargemanhatanproject is a great project in themanhattincity"query_string = "manhattan"print list(matches(large_string, query_string, 0.8))
Above code print: ['manhatan', 'manhattn']
The new regex library that's soon supposed to replace re includes fuzzy matching.
https://pypi.python.org/pypi/regex/
The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.
import regexregex.match('(amazing){e<=1}', 'amaging')
I use fuzzywuzzy to fuzzy match based on threshold and fuzzysearch to fuzzy extract words from the match.
process.extractBests
takes a query, list of words and a cutoff score and returns a list of tuples of match and score above the cutoff score.
find_near_matches
takes the result of process.extractBests
and returns the start and end indices of words. I use the indices to build the words and use the built word to find the index in the large string. max_l_dist
of find_near_matches
is 'Levenshtein distance' which has to be adjusted to suit the needs.
from fuzzysearch import find_near_matchesfrom fuzzywuzzy import processlarge_string = "thelargemanhatanproject is a great project in themanhattincity"query_string = "manhattan"def fuzzy_extract(qs, ls, threshold): '''fuzzy matches 'qs' in 'ls' and returns list of tuples of (word,index) ''' for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold): print('word {}'.format(word)) for match in find_near_matches(qs, word, max_l_dist=1): match = word[match.start:match.end] print('match {}'.format(match)) index = ls.find(match) yield (match, index)
To test:
query_string = "manhattan"print('query: {}\nstring: {}'.format(query_string, large_string))for match,index in fuzzy_extract(query_string, large_string, 70): print('match: {}\nindex: {}'.format(match, index))query_string = "citi"print('query: {}\nstring: {}'.format(query_string, large_string))for match,index in fuzzy_extract(query_string, large_string, 30): print('match: {}\nindex: {}'.format(match, index))query_string = "greet"print('query: {}\nstring: {}'.format(query_string, large_string))for match,index in fuzzy_extract(query_string, large_string, 30): print('match: {}\nindex: {}'.format(match, index))
Output:
query: manhattan string: thelargemanhatanproject is a great project in themanhattincity match: manhatan index: 8 match: manhattin index: 49 query: citi string: thelargemanhatanproject is a great project in themanhattincity match: city index: 58 query: greet string: thelargemanhatanproject is a great project in themanhattincity match: great index: 29