Why are memcpy() and memmove() faster than pointer increments?

c++ c loops

Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time.

SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.

c++ c loops

Memory copy routines can be far more complicated and faster than a simple memory copy via pointers such as:

void simple_memory_copy(void* dst, void* src, unsigned int bytes){  unsigned char* b_dst = (unsigned char*)dst;  unsigned char* b_src = (unsigned char*)src;  for (int i = 0; i < bytes; ++i)    *b_dst++ = *b_src++;}

Improvements

The first improvement one can make is to align one of the pointers on a word boundary (by word I mean native integer size, usually 32 bits/4 bytes, but can be 64 bits/8 bytes on newer architectures) and use word sized move/copy instructions. This requires using a byte to byte copy until a pointer is aligned.

void aligned_memory_copy(void* dst, void* src, unsigned int bytes){  unsigned char* b_dst = (unsigned char*)dst;  unsigned char* b_src = (unsigned char*)src;  // Copy bytes to align source pointer  while ((b_src & 0x3) != 0)  {    *b_dst++ = *b_src++;    bytes--;  }  unsigned int* w_dst = (unsigned int*)b_dst;  unsigned int* w_src = (unsigned int*)b_src;  while (bytes >= 4)  {    *w_dst++ = *w_src++;    bytes -= 4;  }  // Copy trailing bytes  if (bytes > 0)  {    b_dst = (unsigned char*)w_dst;    b_src = (unsigned char*)w_src;    while (bytes > 0)    {      *b_dst++ = *b_src++;      bytes--;    }  }}

Different architectures will perform differently based on if the source or the destination pointer is appropriately aligned. For instance on an XScale processor I got better performance by aligning the destination pointer rather than the source pointer.

To further improve performance some loop unrolling can be done, so that more of the processor's registers are loaded with data and that means the load/store instructions can be interleaved and have their latency hidden by additional instructions (such as loop counting etc). The benefit this brings varies quite a bit by the processor, since load/store instruction latencies can be quite different.

At this stage the code ends up being written in Assembly rather than C (or C++) since you need to manually place the load and store instructions to get maximum benefit of latency hiding and throughput.

Generally a whole cache line of data should be copied in one iteration of the unrolled loop.

Which brings me to the next improvement, adding pre-fetching. These are special instructions that tell the processor's cache system to load specific parts of memory into its cache. Since there is a delay between issuing the instruction and having the cache line filled, the instructions need to be placed in such a way so that the data is available when just as it is to be copied, and no sooner/later.

This means putting prefetch instructions at the start of the function as well as inside the main copy loop. With the prefetch instructions in the middle of the copy loop fetching data that will be copied in several iterations time.

I can't remember, but it may also be beneficial to prefetch the destination addresses as well as the source ones.

Factors

The main factors that affect how fast memory can be copied are:

The latency between the processor, its caches, and main memory.
The size and structure of the processor's cache lines.
The processor's memory move/copy instructions (latency, throughput, register size, etc).

So if you want to write an efficient and fast memory cope routine you'll need to know quite a lot about the processor and architecture you are writing for. Suffice to say, unless you're writing on some embedded platform it would be much easier to just use the built in memory copy routines.

c++ c loops

memcpy can copy more than one byte at once depending on the computer's architecture. Most modern computers can work with 32 bits or more in a single processor instruction.

From one example implementation:

    00026          * For speedy copying, optimize the common case where both pointers    00027          * and the length are word-aligned, and copy word-at-a-time instead    00028          * of byte-at-a-time. Otherwise, copy by bytes.

CodeHunter

Why are memcpy() and memmove() faster than pointer increments?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last