Checking fuzzy/approximate substring existing in a longer string, in Python? Checking fuzzy/approximate substring existing in a longer string, in Python? python python

Checking fuzzy/approximate substring existing in a longer string, in Python?


How about using difflib.SequenceMatcher.get_matching_blocks?

>>> import difflib>>> large_string = "thelargemanhatanproject">>> query_string = "manhattan">>> s = difflib.SequenceMatcher(None, large_string, query_string)>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))0.8888888888888888>>> query_string = "banana">>> s = difflib.SequenceMatcher(None, large_string, query_string)>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))0.6666666666666666

UPDATE

import difflibdef matches(large_string, query_string, threshold):    words = large_string.split()    for word in words:        s = difflib.SequenceMatcher(None, word, query_string)        match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)        if len(match) / float(len(query_string)) >= threshold:            yield matchlarge_string = "thelargemanhatanproject is a great project in themanhattincity"query_string = "manhattan"print list(matches(large_string, query_string, 0.8))

Above code print: ['manhatan', 'manhattn']


The new regex library that's soon supposed to replace re includes fuzzy matching.

https://pypi.python.org/pypi/regex/

The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.

import regexregex.match('(amazing){e<=1}', 'amaging')


I use fuzzywuzzy to fuzzy match based on threshold and fuzzysearch to fuzzy extract words from the match.

process.extractBests takes a query, list of words and a cutoff score and returns a list of tuples of match and score above the cutoff score.

find_near_matches takes the result of process.extractBests and returns the start and end indices of words. I use the indices to build the words and use the built word to find the index in the large string. max_l_dist of find_near_matches is 'Levenshtein distance' which has to be adjusted to suit the needs.

from fuzzysearch import find_near_matchesfrom fuzzywuzzy import processlarge_string = "thelargemanhatanproject is a great project in themanhattincity"query_string = "manhattan"def fuzzy_extract(qs, ls, threshold):    '''fuzzy matches 'qs' in 'ls' and returns list of     tuples of (word,index)    '''    for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):        print('word {}'.format(word))        for match in find_near_matches(qs, word, max_l_dist=1):            match = word[match.start:match.end]            print('match {}'.format(match))            index = ls.find(match)            yield (match, index)

To test:

query_string = "manhattan"print('query: {}\nstring: {}'.format(query_string, large_string))for match,index in fuzzy_extract(query_string, large_string, 70):    print('match: {}\nindex: {}'.format(match, index))query_string = "citi"print('query: {}\nstring: {}'.format(query_string, large_string))for match,index in fuzzy_extract(query_string, large_string, 30):    print('match: {}\nindex: {}'.format(match, index))query_string = "greet"print('query: {}\nstring: {}'.format(query_string, large_string))for match,index in fuzzy_extract(query_string, large_string, 30):    print('match: {}\nindex: {}'.format(match, index))

Output:

query: manhattan  string: thelargemanhatanproject is a great project in themanhattincity  match: manhatan  index: 8  match: manhattin  index: 49  query: citi  string: thelargemanhatanproject is a great project in themanhattincity  match: city  index: 58  query: greet  string: thelargemanhatanproject is a great project in themanhattincity  match: great  index: 29