Search for string allowing for one mismatch in any location of the string

python string pattern-matching string-matching dna-sequence

Before you read on, have you looked at biopython?

It appears that you want to find approximate matches with one substitution error, and zero insertion/deletion errors i.e. a Hamming distance of 1.

If you have a Hamming distance match function (see e.g. the link provided by Ignacio), you could use it like this to do a search for the first match:

any(Hamming_distance(genome[x:x+25], sequence) == 1 for x in xrange(len(genome)))

but this would be rather slow, because (1) the Hamming distance function would keep on grinding after the 2nd substitution error (2) after failure, it advances the cursor by one rather than skipping ahead based on what it saw (like a Boyer-Moore search does).

You can overcome (1) with a function like this:

def Hamming_check_0_or_1(genome, posn, sequence):    errors = 0    for i in xrange(25):        if genome[posn+i] != sequence[i]:            errors += 1            if errors >= 2:                return errors    return errors

Note: that's intentionally not Pythonic, it's Cic, because you'd need to use C (perhaps via Cython) to get reasonable speed.

Some work on bit-parallel approximate Levenshtein searches with skipping has been done by Navarro and Raffinot (google "Navarro Raffinot nrgrep") and this could be adapted to Hamming searches. Note that bit-parallel methods have limitations on length of query string and alphabet size but yours are 25 and 4 respectively so no problems there. Update: skipping probably not much help with an alphabet size of 4.

When you google for Hamming distance search, you will notice lots of stuff about implementing it in hardware, and not much in software. This is a big hint that maybe whatever algorithm you come up with ought to be implemented in C or some other compiled language.

Update: Working code for a bit-parallel method

I've also supplied a simplistic method for helping with the correctness checking, and I've packaged a variation of Paul's re code for some comparisons. Note that using re.finditer() delivers non-overlapping results, and this can cause a distance-1 match to shadow an exact match; see my last test case.

The bit-parallel method has these features: guaranteed linear behaviour O(N) where N is text length. Note naive method is O(NM) as is the regex method (M is the pattern length). A Boyer-Moore-style method would be worst case O(NM) and expected O(N). Also the bit-parallel method can be used easily when input has to be buffered: it can be fed a byte or a megabyte at a time; no look-ahead, no problems with buffer boundaries. The big advantage: the speed of that simple per-input-byte code when coded in C.

Downsides: the pattern length is effectively limited to the number of bits in a fast register e.g. 32 or 64. In this case the pattern length is 25; no problem. It uses extra memory (S_table) proportional to the number of distinct characters in the pattern. In this case, the "alphabet size" is only 4; no problem.

Details from this technical report. The algorithm there is for approximate search usin Levenshtein distance. To convert to using Hamming distance, I simply (!) removed the pieces of statement 2.1 that handle insertion and deletion. You'll notice lots of reference to "R" with a "d" superscript. "d" is distance. We need only 0 and 1. These "R"s become the R0 and R1 variables in the code below.

# coding: asciifrom collections import defaultdictimport re_DEBUG = 0# "Fast Text Searching with Errors" by Sun Wu and Udi Manber# TR 91-11, Dept of Computer Science, University of Arizona, June 1991.# http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8854def WM_approx_Ham1_search(pattern, text):    """Generate (Hamming_dist, start_offset)    for matches with distance 0 or 1"""    m = len(pattern)    S_table = defaultdict(int)    for i, c in enumerate(pattern):        S_table[c] |= 1 << i    R0 = 0    R1 = 0    mask = 1 << (m - 1)    for j, c in enumerate(text):        S = S_table[c]        shR0 = (R0 << 1) | 1        R0 = shR0 & S        R1 = ((R1 << 1) | 1) & S | shR0        if _DEBUG:            print "j= %2d msk=%s S=%s R0=%s R1=%s" \                % tuple([j] + map(bitstr, [mask, S, R0, R1]))        if R0 & mask: # exact match            yield 0, j - m + 1        elif R1 & mask: # match with one substitution            yield 1, j - m + 1if _DEBUG:    def bitstr(num, mlen=8):       wstr = ""       for i in xrange(mlen):          if num & 1:             wstr = "1" + wstr          else:             wstr = "0" + wstr          num >>= 1       return wstrdef Ham_dist(s1, s2):    """Calculate Hamming distance between 2 sequences."""    assert len(s1) == len(s2)    return sum(c1 != c2 for c1, c2 in zip(s1, s2))def long_check(pattern, text):    """Naively and understandably generate (Hamming_dist, start_offset)    for matches with distance 0 or 1"""    m = len(pattern)    for i in xrange(len(text) - m + 1):        d = Ham_dist(pattern, text[i:i+m])        if d < 2:            yield d, idef Paul_McGuire_regex(pattern, text):    searchSeqREStr = (        '('        + pattern        + ')|('        + ')|('.join(            pattern[:i]            + "[ACTGN]".replace(c,'')            + pattern[i+1:]            for i,c in enumerate(pattern)            )        + ')'        )    searchSeqRE = re.compile(searchSeqREStr)    for match in searchSeqRE.finditer(text):        locn = match.start()        dist = int(bool(match.lastindex - 1))        yield dist, locnif __name__ == "__main__":    genome1 = "TTTACGTAAACTAAACTGTAA"    #         01234567890123456789012345    #                   1         2    tests = [        (genome1, "ACGT ATGT ACTA ATCG TTTT ATTA TTTA"),        ("T" * 10, "TTTT"),        ("ACGTCGTAAAA", "TCGT"), # partial match can shadow an exact match        ]    nfailed = 0    for genome, patterns in tests:        print "genome:", genome        for pattern in patterns.split():            print pattern            a1 = list(WM_approx_Ham1_search(pattern, genome))            a2 = list(long_check(pattern, genome))            a3 = list(Paul_McGuire_regex(pattern, genome))            print a1            print a2            print a3            print a1 == a2, a2 == a3            nfailed += (a1 != a2 or a2 != a3)    print "***", nfailed

python string pattern-matching string-matching dna-sequence

Python regex library supports fuzzy regular expression matching. One advantage over TRE is that it allows to find all matches of regular expression in the text (supports overlapping matches as well).

import regexm=regex.findall("AA", "CAG")>>> []m=regex.findall("(AA){e<=1}", "CAAG") # means allow up to 1 errorm>>> ['CA', 'AG']

python string pattern-matching string-matching dna-sequence

I googled for "toxoplasma gondii parasite genome" to find some of these genome files online. I found what I think was close, a file titled "TgondiiGenomic_ToxoDB-6.0.fasta" at http://toxodb.org, about 158Mb in size. I used the following pyparsing expression to extract the gene sequences, it took just under 2 minutes:

fname = "TgondiiGenomic_ToxoDB-6.0.fasta"fastasrc = open(fname).read()   # yes! just read the whole dang 158Mb!"""Sample header:>gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448"""integer = Word(nums).setParseAction(lambda t:int(t[0]))genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") +                 "length=" + integer("genelen") + LineEnd() +                 Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))# read gene data from .fasta file - takes just under a couple of minutesgenedata = OneOrMore(genebit).parseString(fastasrc)

(Surprise! some of the gene sequences include runs of 'N's! What the heck is that about?!)

Then I wrote this class as a subclass of the pyparsing Token class, for doing close matches:

class CloseMatch(Token):    def __init__(self, seq, maxMismatches=1):        super(CloseMatch,self).__init__()        self.name = seq        self.sequence = seq        self.maxMismatches = maxMismatches        self.errmsg = "Expected " + self.sequence        self.mayIndexError = False        self.mayReturnEmpty = False    def parseImpl( self, instring, loc, doActions=True ):        start = loc        instrlen = len(instring)        maxloc = start + len(self.sequence)        if maxloc <= instrlen:            seq = self.sequence            seqloc = 0            mismatches = []            throwException = False            done = False            while loc < maxloc and not done:                if instring[loc] != seq[seqloc]:                    mismatches.append(seqloc)                    if len(mismatches) > self.maxMismatches:                        throwException = True                        done = True                loc += 1                seqloc += 1        else:            throwException = True        if throwException:            exc = self.myException            exc.loc = loc            exc.pstr = instring            raise exc        return loc, (instring[start:loc],mismatches)

For every match, this will return a tuple containing the actual string that was matched, and a list of the mismatch locations. Exact matches would of course return an empty list for the second value. (I like this class, I think I'll add it to the next release of pyparsing.)

I then ran this code to search for "up-to-2-mismatch" matches in all of the sequences read from the .fasta file (recall that genedata is a sequence of ParseResults groups, each containing an id, an integer length, and a sequence string):

searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)for g in genedata:    print "%s (%d)" % (g.id, g.genelen)    print "-"*24    for t,startLoc,endLoc in searchseq.scanString(g.gene):        matched, mismatches = t[0]        print "MATCH:", searchseq.sequence        print "FOUND:", matched        if mismatches:            print "      ", ''.join(' ' if i not in mismatches else '*'                             for i,c in enumerate(searchseq.sequence))        else:            print "<exact match>"        print "at location", startLoc        print    print

I took the search sequence at random from one of the gene bits, to be sure I could find an exact match, and just out of curiosity to see how many 1- and 2-element mismatches there were.

This took a little while to run. After 45 minutes, I had this output, listing each id and gene length, and any partial matches found:

scf_1104442825154 (964)------------------------scf_1104442822828 (942)------------------------scf_1104442824510 (987)------------------------scf_1104442823180 (1065)------------------------...

I was getting discouraged, not to see any matches until:

scf_1104442823952 (1188)------------------------MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAACGGAATCGAATGGAAT                *      *        at location 33MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGAATCGAATGGAAT                       *        at location 175MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGAATCGAATGGAAT                       *        at location 474MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGAATCGAATGGAAT                       *        at location 617MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGAATCGAATAGAAT                       *   *    at location 718MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGATTCGAATGGAAT                    *  *        at location 896MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGAATCGAATGGTAT                       *     *  at location 945

And finally my exact match at:

scf_1104442823584 (1448)------------------------MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGACTCGAATGGAAT                    *  *        at location 177MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGAATCAAATGGAAT                       *        at location 203MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCAAATGGAATCGAATGGAAT             *         *        at location 350MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGAATCGAATGGAAA                       *       *at location 523MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCAAATGGAATCGAATGGAAT             *         *        at location 822MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCATCGAATGGAATCTAATGGAAT<exact match>at location 848MATCH: ATCATCGAATGGAATCTAATGGAATFOUND: ATCGTCGAATGGAGTCTAATGGAAT          *         *           at location 969

So while this didn't set any speed records, I got the job done, and found some 2-matches too, in case they might be of interest.

For comparison, here is an RE-based version, that finds 1-mismatch matches only:

import reseqStr = "ATCATCGAATGGAATCTAATGGAAT"searchSeqREStr = seqStr + '|' + \    '|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:]              for i,c in enumerate(seqStr))searchSeqRE = re.compile(searchSeqREStr)for g in genedata:    print "%s (%d)" % (g.id, g.genelen)    print "-"*24    for match in searchSeqRE.finditer(g.gene):        print "MATCH:", seqStr        print "FOUND:", match.group(0)        print "at location", match.start()        print    print

(At first, I tried searching the raw FASTA file source itself, but was puzzled why so few matches compared to the pyparsing version. Then I realized that some of the matches must cross the line breaks, since the fasta file output is wrapped at n characters.)

So after the first pyparsing pass to extract the gene sequences to match against, this RE-based searcher then took about another 1-1/2 minutes to scan all of the un-textwrapped sequences, to find all of the same 1-mismatch entries that the pyparsing solution did.

CodeHunter

Search for string allowing for one mismatch in any location of the string

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last