Why is mb_strpos() so considerably slower than strpos()? Why is mb_strpos() so considerably slower than strpos()? php php

Why is mb_strpos() so considerably slower than strpos()?


To understand why the functions have a different runtime you need to understand what they actually do. Because summing them up as ‘they search for needle in haystack’ isn’t enough.

strpos

If you look at the implementation of strpos, it uses zend_memstr internally, which implements a pretty naive algorithm for searching for needle in haystack: Basically, it uses memchr to find the first byte of needle in haystack and then uses memcmp to check whether the whole needle begins at that position. If not, it repeats the search for the first byte of needle from the position of the previous match of the first byte.

Knowing this, we can say that strpos does only search for a byte sequence in a byte sequence using a naive search algorithm.

mb_strpos

This function is the multi-byte counterpart to strpos. This makes searching a little more complex as you can’t just look at the bytes without knowing to which character they belong to.

mb_strpos uses mbfl_strpos, which does a lot more in comparison to the simple algorithm of zend_memstr, it’s like 200 lines of complex code (mbfl_strpos) compared to 30 lines of slick code (zend_memstr).

We can skip the part where both needle and haystack are converted to UTF-8 if necessary, and come to the major chunk of code.

First we have two setup loops and then there is the loop that proceeds the pointer according to the given offset where you can see that they aware of actual characters and how they skip whole encoded UTF-8 characters: since UTF-8 is a variable-width character encoding where the first byte of each encoded character denotes the whole length of the encoded character. This information is stored in the u8_tbl array.

Finally, the loop where the actual search happens. And here we have something interesting, because the test for needle at a certain position in haystack is tried in reverse. And if one byte did not match, the jump table jtbl is used to find the next possible position for needle in haystack. This is actually an implementation of the Boyer–Moore string search algorithm.

So now we know that mb_strpos

  • converts the strings to UTF-8, if necessary
  • is aware of actual characters
  • uses the Boyer–Moore search algorithm

preg_match

As for preg_match, it uses the PCRE library. Its standard matching algorithm uses a nondeterministic finite automaton (NFA) to find a match conducting a depth-first search of the pattern tree. This is basically a naive search approach.


I am leaving out preg_match to make the analysis more punctuated.

Taken your observation that mb_strpos is relatively slower compared to strpos, it leads you to the assumption that — because of the consumed time — mb_strpos does more than strpos.

I think this observation is correct.

You then asked what is that "more" that is causing the time difference.

I try to give a simple answer: That "more" is because strpos operates on binary strings (one character = 8 bit = 1 octet = 1 byte). mb_strpos operates on encoded character sequences (as nearly all of the mb_* functions do) which can be X bits, perhaps even in variable length per each character.

As this is always about a specific character encoding, both the haystack as well as the needle string (probably) need to be first validated for that encoding, and then the whole operation to find the string position needs to be done in that specific character encoding.

That is translation work and — depending on encoding — also requires a specific search algorithm.

Next to that the mb extension also needs to have some structures in memory to organize the different character encodings, be it translation tables and/or specific algorithms. See the extra parameter you inject — the name of the encoding for example.

That is by far more work than just doing simple byte-by-byte comparisons.

For example the GBK character encoding is pretty interesting when you need to encode or decode a certain character. The mb string function in this case needs to take all these specifics into account to find out if and at which position the character is. As PHP only has binary strings in the userland from which you would call that function, the whole work needs to be done on each single function call.

To illustrate this even more, if you look through the list of supported encodings (mb_list_encodings), you can also find some like BASE64, UUENCODE, HTML-ENTITIES and Quoted-Printable. As you might imagine, all these are handled differently.

For example a single numeric HTML entity can be up to 1024 bytes large, if not even larger. An extreme example I know and love is this one. However, for that encoding, it has to be handled by the mb_strpos algorithm.


Reason of slowness

Taking a look at the 5.5.6 PHP source files, the delay seems to arise for the most part in the mbfilter.c, where - as hakre surmised - both haystack and needle need to be validated and converted, every time mb_strpos (or, I guess, most of the mb_* family) gets called:

Unless haystack is in the default format, encode it to the default format:

if (haystack->no_encoding != mbfl_no_encoding_utf8) {        mbfl_string_init(&_haystack_u8);        haystack_u8 = mbfl_convert_encoding(haystack, &_haystack_u8, mbfl_no_encoding_utf8);        if (haystack_u8 == NULL) {                result = -4;                goto out;        }} else {        haystack_u8 = haystack;}

Unless needle is in the default format, encode it to the default format:

if (needle->no_encoding != mbfl_no_encoding_utf8) {        mbfl_string_init(&_needle_u8);        needle_u8 = mbfl_convert_encoding(needle, &_needle_u8, mbfl_no_encoding_utf8);        if (needle_u8 == NULL) {                result = -4;                goto out;        }} else {        needle_u8 = needle;}

According to a quick check with valgrind, the encoding conversion accounts for a huge part of mb_strpos's runtime, about 84% of the total, or five-sixths:

218,552,085  ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_strpos [/usr/src/php-5.5.6/sapi/cli/php]183,812,085  ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_convert_encoding [/usr/src/php-5.5.6/sapi/cli/php]

which appears to be consistent with the OP's timings of mb_strpos versus strpos.

Encoding not considered, mb_strpos'ing a string is exactly the same of strpos'ing a slightly longer string. Okay, a string up to four times as long if you have really awkward strings, but even then, you would get a delay by a factor of four, not by a factor of twenty. The additional 5-6X slowdown arises from encoding times.

Accelerating mb_strpos...

So what can you do? You can skip those two steps by ensuring that you have internally the strings already in the "basic" format in which mbfl* do conversion and compare, which is mbfl_no_encoding_utf8 (UTF-8):

  • Keep your data in UTF-8.
  • Convert user input to UTF-8 as soon as practical.
  • Convert, if necessary, back to client encoding if needed.

Then your pseudo-code:

$haystack = "...";$needle   = "...";$res = mb_strpos($haystack, $needle, 0, $Encoding);

becomes:

$haystack = "...";$needle   = "...";mb_internal_encoding('UTF-8') or die("Cannot set encoding");$haystack   = mb_convert_encoding($haystack, 'UTF-8' [, $SourceEncoding]);$needle     = mb_convert_encoding($needle, 'UTF-8', [, $SourceEncoding]);$res = mb_strpos($haystack, $needle, 0);

...when it's worth it

Of course this is only convenient if the "setup time" and maintenance of a whole UTF-8 base is appreciably smaller than the "run time" of doing conversions implicitly in every mb_* function.