How does memchr() work under the hood?
I would suggest taking a look at GNU libc's source. As for most functions, it will contain both a generic optimized C version of the function, and optimized assembly language versions for as many supported architectures as possible, taking advantage of machine specific tricks.
The x86-64 SSE2 version combines the results from pcmpeqb
on a whole cache-line of data at once (four 16B vectors), to amortize the overhead of the early-exit pmovmskb
/test
/jcc
.
gcc and clang are currently incapable of auto-vectorizing loops with if() break
early-exit conditions, so they make naive byte-at-a-time asm from the obvious C implementation.
This implementation of memchr from newlib is one example of someone's optimizing memchr:it's reading and testing 4 bytes at a time (apart from memchr, other functions in the newlib library are here).
Incidentally, most of the the source code for the MSVC run-time library is available, as an optional part of the MSVC installation (so, you could look at that).
Here is FreeBSD's (BSD-licensed) memchr() from memchr.c. FreeBSD's online source code browser is a good reference for time-tested, BSD-licensed code examples.
void *memchr(s, c, n) const void *s; unsigned char c; size_t n;{ if (n != 0) { const unsigned char *p = s; do { if (*p++ == c) return ((void *)(p - 1)); } while (--n != 0); } return (NULL);}