fuzzy string matching with grep fuzzy string matching with grep shell shell

fuzzy string matching with grep


There used to be a tool called agrep for fuzzy regex matching, but it got abandoned.

http://en.wikipedia.org/wiki/Agrep has a bit of history and links to related tools.

https://github.com/Wikinaut/agrep looks like a revived open source release, but I have not tested it.

Failing that, see if you can find tre-agrep for your distro.


You can use tre-agrep and specify the edit distance with the -E switch. For example if you have a file foo:

cat <<< EOF > fooACTGGGAAAATAAACTAACTAAACTAACTGGGTAAACTAEOF

You can match every line with an edit distance of up to 9 like this:

tre-agrep -s -9 -w ACTGGGTAAACTA foo

Output:

4:ACTGGGAAAATAAACTA4:ACTAAACTA0:ACTGGGTAAACTA


There's a Python library called fuzzysearch (that I wrote) which provides precisely the required functionality.

Here's some sample code that should work:

from fuzzysearch import find_near_matcheswith open('path/to/file', 'r') as f:    data = f.read()# 1. search allowing up to 3 substitutionsmatches = find_near_matches("ACTGGGTAAACTA", data, max_substitutions=3)# 2. also allow insertions and deletions, i.e. allow an edit distance#    a.k.a. Levenshtein distance of up to 3matches = find_near_matches("ACTGGGTAAACTA", data, max_l_dist=3)