64-bit Linux performance issue with memset 64-bit Linux performance issue with memset linux linux

64-bit Linux performance issue with memset


I believe that virtualization is the culprit: I have been running some benchmarks on my own (random number generation in bulk, sequential searches; also 64-bit) and found out that the code runs ~2x slower within Linux in VirtualBox than natively under windows. The funny thing is, the code does no I/O (except simple printf now and then, in between timings) and uses little memory (all data fits into L1 cache), so one could think that you could exclude page table management and TLB overheads.

This is mysterious indeed. I have noticed that VirtualBox reports to the VM that SSE 4.1 and SSE 4.2 instructions are not supported, even though the CPU supports them, and the program using them runs fine(!) in a VM. I have no time to investigate the issue further, but you REALLY should time it on a real machine. Unfortunately, my program won't run on 32 bits, so I couldn't test for slowdown in 32-bit mode.


I can confirm that on my non-virtualized Mandriva Linux system the x86_64 version is slightly (about 7%) slower. In both cases the memset() library function is called, regardless of the instruction set word size.

A casual look at the assembly code of both library implementations reveals that the x86_64 version is significantly more complex. I assume that this has mostly to do with the fact that the 32-bit version has to deal with only 4 possible alignment cases, versus the 8 possible alignment cases of the 64-bit version. It also seems that the x86_64 memset() loop has been more extensively unrolled, perhaps due to different compiler optimizations.

Another factor that could account for the slower operations is the increased I/O load associated with the use of a word size of 64 bits. Both code and metadata (pointers e.t.c.) generally get larger in 64-bit applications.

Also, keep in mind that the library implementations included in most distributions are targeted to whatever CPU the maintainers consider to be the current lowest common denominator for each processor family. This may leave the 64-bit processors at a disadvantage, since the 32-bit instruction set has been stable for some time now.


When compiling your example code the compiler sees the fixed block size (~8MB) and decides to use the library version. Try code for much smaller blocks (for memset'ing just a few bytes) - compare the disassembly.

Though I do not know why the x64 version is slower. I guess there is an issue in your time measurement code.

From the changelog of gcc 4.3:

Code generation of block move (memcpy) and block set (memset) was rewritten. GCC can now pick the best algorithm (loop, unrolled loop, instruction with rep prefix or a library call) based on the size of the block being copied and the CPU being optimized for. A new option -minline-stringops-dynamically has been added. With this option string operations of unknown size are expanded such that small blocks are copied by in-line code, while for large blocks a library call is used. This results in faster code than -minline-all-stringops when the library implementation is capable of using cache hierarchy hints. The heuristic choosing the particular algorithm can be overwritten via -mstringop-strategy. Newly also memset of values different from 0 is inlined.

Hope this explains what the compiler designers try to do (even if this is for another version) ;-)