Python: optimal search for substring in list of strings Python: optimal search for substring in list of strings python python

Python: optimal search for substring in list of strings


For the sort of thing you're trying (searching for a fixed set of a whole bunch of strings in a whole bunch of other strings), parallelizing and minor tweaks won't help much. You need algorithmic improvements.

For a start, I'd suggest using the Aho-Corasick string matching algorithm. Basically, in exchange for some precompute work to build a matcher object from your set of fixed strings, you can scan another string for all of those fixed strings at once, in a single pass.

So instead of scanning 60K strings 50K+ times each (three BILLION scans?!?), you can scan them each once with only slightly higher cost than a normal single scan, and get all the hits.

Best part is, you're not writing it yourself. PyPI (the Python package index) already has the pyahocorasick package written for you. So try it out.

Example of use:

import ahocorasicklistStrings = [ACDE, CDDE, BPLL, ...]listSubstrings = [ACD, BPI, KLJ, ...]auto = ahocorasick.Automaton()for substr in listSubstrings:    auto.add_word(substr, substr)auto.make_automaton()...for astr in listStrings:    for end_ind, found in auto.iter(astr):        w.write(found+astr)

This will write multiple times if a substring ("needle") is found in string being searched ("haystack") more than once. You could change the loop to make it only write on the first hit for a given needle in a given haystack by using a set to dedup:

for astr in listStrings:    seen = set()    for end_ind, found in auto.iter(astr):        if found not in seen:            seen.add(found)            w.write(found+astr)

You can further tweak this to output the needles for a given haystack in the same order they appeared in listSubstrings (and uniquifying while you're at it) by storing the index of the words as or with their values so you can sort hits (presumably small numbers, so sort overhead is trivial):

from future_builtins import map  # Only on Py2, for more efficient generator based mapfrom itertools import groupbyfrom operator import itemgetterauto = ahocorasick.Automaton()for i, substr in enumerate(listSubstrings):    # Store index and substr so we can recover original ordering    auto.add_word(substr, (i, substr))auto.make_automaton()...for astr in listStrings:    # Gets all hits, sorting by the index in listSubstrings, so we output hits    # in the same order we theoretically searched for them    allfound = sorted(map(itemgetter(1), auto.iter(astr)))    # Using groupby dedups already sorted inputs cheaply; the map throws away    # the index since we don't need it    for found, _ in groupby(map(itemgetter(1), allfound)):        w.write(found+astr)

For performance comparisons, I used a variant on mgc's answer that is more likely to contain matches, as well as enlarging the haystacks. First, setup code:

>>> from random import choice, randint>>> from string import ascii_uppercase as uppercase>>> # 5000 haystacks, each 1000-5000 characters long>>> listStrings = [''.join([choice(uppercase) for i in range(randint(1000, 5000))]) for j in range(5000)]>>> # ~1000 needles (might be slightly less for dups), each 3-12 characters long>>> listSubstrings = tuple({''.join([choice(uppercase) for i in range(randint(3, 12))]) for j in range(1000)})>>> auto = ahocorasick.Automaton()>>> for needle in listSubstrings:...     auto.add_word(needle, needle)...>>> auto.make_automaton()

And now to actually test it (using ipython %timeit magic for microbenchmarks):

>>> sum(needle in haystack for haystack in listStrings for needle in listSubstrings)80279  # Will differ depending on random seed>>> sum(len(set(map(itemgetter(1), auto.iter(haystack)))) for haystack in listStrings)80279  # Same behavior after uniquifying results>>> %timeit -r5 sum(needle in haystack for haystack in listStrings for needle in listSubstrings)1 loops, best of 5: 9.79 s per loop>>> %timeit -r5 sum(len(set(map(itemgetter(1), auto.iter(haystack)))) for haystack in listStrings)1 loops, best of 5: 460 ms per loop

So for checking for ~1000 smallish strings in each of 5000 moderate size strings, pyahocorasick beats individual membership tests by a factor of ~21x on my machine. It scales well as the size of listSubstrings increases too; when I initialized it the same way, but with 10,000 smallish strings instead of 1000, the total time required increased from ~460 ms to ~852 ms, a 1.85x time multiplier to perform 10x as many logical searches.

For the record, the time to build the automatons is trivial in this sort of context. You pay it once up front not once per haystack, and testing shows the ~1000 string automaton took ~1.4 ms to build and occupied ~277 KB of memory (above and beyond the strings themselves); the ~10000 string automaton took ~21 ms to build, and occupied ~2.45 MB of memory.


Maybe you can try to chunk one of the two list (the biggest ? although intuitively I would cut listStrings) in smaller ones then use threading to run these search in parallel (the Pool class of multiprocessing offers a convenient way to do this) ? I had some significant speed-up using something like :

from multiprocessing import Poolfrom itertools import chain, islice# The function to be run in parallel :def my_func(strings):    return [j+i for i in strings for j in listSubstrings if i.find(j)>-1]# A small recipe from itertools to chunk an iterable :def chunk(it, size):    it = iter(it)    return iter(lambda: tuple(islice(it, size)), ())# Generating some fake & random value :from random import randintlistStrings = \    [''.join([chr(randint(65, 90)) for i in range(randint(1, 500))]) for j in range(10000)]listSubstrings = \    [''.join([chr(randint(65, 90)) for i in range(randint(1, 100))]) for j in range(1000)]# You have to prepare the searches to be performed:prep = [strings for strings in chunk(listStrings, round(len(listStrings) / 8))]with Pool(4) as mp_pool:    # multiprocessing.map is a parallel version of map()    res = mp_pool.map(my_func, prep)# The `res` variable is a list of list, so now you concatenate them# in order to have a flat result listresult = list(chain.from_iterable(res))

Then you could write the whole result variable (instead of writing it line by lines) :

with open('result_file', 'w') as f:    f.write('\n'.join(result))

Edit 01/05/18: flatten the result using itertools.chain.from_iterable instead of a ugly workaround using map side-effects, following ShadowRanger's advice.


Are your substrings all the same length? Your example uses 3-letter substrings. In that case, you could create a dict with 3-letter substrings as keys to a list of strings:

index = {}for string in listStrings:    for i in range(len(string)-2):        substring = string[i:i+3]        index_strings = index.get(substring, [])        index_strings.append(string)        index[substring] = index_stringsfor substring in listSubstrings:    index_strings = index.get(substring, [])    for string in index_strings:        w.write(substring+string)