What is the fastest substring search algorithm? What is the fastest substring search algorithm? c c

What is the fastest substring search algorithm?


Build up a test library of likely needles and haystacks. Profile the tests on several search algorithms, including brute force. Pick the one that performs best with your data.

Boyer-Moore uses a bad character table with a good suffix table.

Boyer-Moore-Horspool uses a bad character table.

Knuth-Morris-Pratt uses a partial match table.

Rabin-Karp uses running hashes.

They all trade overhead for reduced comparisons to a different degree, so the real world performance will depend on the average lengths of both the needle and haystack. The more initial overhead, the better with longer inputs. With very short needles, brute force may win.

Edit:

A different algorithm might be best for finding base pairs, english phrases, or single words. If there were one best algorithm for all inputs, it would have been publicized.

Think about the following little table. Each question mark might have a different best search algorithm.

                 short needle     long needleshort haystack         ?               ?long haystack          ?               ?

This should really be a graph, with a range of shorter to longer inputs on each axis. If you plotted each algorithm on such a graph, each would have a different signature. Some algorithms suffer with a lot of repetition in the pattern, which might affect uses like searching for genes. Some other factors that affect overall performance are searching for the same pattern more than once and searching for different patterns at the same time.

If I needed a sample set, I think I would scrape a site like google or wikipedia, then strip the html from all the result pages. For a search site, type in a word then use one of the suggested search phrases. Choose a few different languages, if applicable. Using web pages, all the texts would be short to medium, so merge enough pages to get longer texts. You can also find public domain books, legal records, and other large bodies of text. Or just generate random content by picking words from a dictionary. But the point of profiling is to test against the type of content you will be searching, so use real world samples if possible.

I left short and long vague. For the needle, I think of short as under 8 characters, medium as under 64 characters, and long as under 1k. For the haystack, I think of short as under 2^10, medium as under a 2^20, and long as up to a 2^30 characters.


Published in 2011, I believe it may very well be the "Simple Real-Time Constant-Space String Matching" algorithm by Dany Breslauer, Roberto Grossi, and Filippo Mignosi.

Update:

In 2014 the authors published this improvement: Towards optimal packed string matching.


The http://www-igm.univ-mlv.fr/~lecroq/string/index.htmllink you point to isan excellent source and summary of some of the best known and researchedstring matching algorithms.

Solutions to most search problems involvetrade offs with respect to pre-processing overhead, time andspace requirements. No singlealgorithm will be optimal or practical in all cases.

If you objective is to design a specific algorithm for string searching, then ignore therest of what I have to say, If you want to develop a generalized string searching serviceroutine then try the following:

Spend some time reviewing the specific strengths and weaknesses ofthe algorithms you have already referenced. Conduct thereview with the objective of finding a set ofalgorithms that cover the range and scope of string searches you are interested in. Then, build a front end search selector based on a classifierfunction to target the best algorithm for the given inputs. This way you mayemploy the most efficient algorithm to do the job. This is particularlyeffective when an algorithm is very good for certain searches but degrades poorly. Forexample, brute force is probably the best for needles of length 1 butquickly degrades as needle length increases, whereupon the sustik-moore algoritim may become more efficient (over small alphabets), then for longer needles and larger alphabets, the KMP or Boyer-Moore algorithms may be better. These are just examples to illustrate a possible strategy.

The multiple algorithm approach not a new idea. I believe it has been employed by a fewcommercial Sort/Search packages (e.g. SYNCSORT commonly used on mainframes implementsseveral sort algorithms and uses heuristics to choose the "best" one for the given inputs)

Each search algorithm comes in several variations thatcan make significant differences to its performance, as,for example, this paper illustrates.

Benchmark your service to categorize the areas where additional search strategies are needed or to more effectivelytune your selector function. This approach is not quick or easy but ifdone well can produce very good results.