How to increase performance of memcpy How to increase performance of memcpy c c

How to increase performance of memcpy


I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

Performance (10000x 4MB block memcpy): 1 thread :  1826 MB/sec 2 threads:  3118 MB/sec 3 threads:  4121 MB/sec 4 threads: 10020 MB/sec 5 threads: 12848 MB/sec 6 threads: 14340 MB/sec 8 threads: 17892 MB/sec10 threads: 21781 MB/sec12 threads: 25721 MB/sec14 threads: 25318 MB/sec16 threads: 19965 MB/sec24 threads: 13158 MB/sec32 threads: 12497 MB/sec

I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

#define NUM_CPY_THREADS 4HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};typedef struct{    int ct;    void * src, * dest;    size_t size;} mt_cpy_t;mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};DWORD WINAPI thread_copy_proc(LPVOID param){    mt_cpy_t * p = (mt_cpy_t * ) param;    while(1)    {        WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);        memcpy(p->dest, p->src, p->size);        ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);    }    return 0;}int startCopyThreads(){    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)    {        hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);        hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);        mtParamters[ctr].ct = ctr;        hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL);     }    return 0;}void * mt_memcpy(void * dest, void * src, size_t bytes){    //set up parameters    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)    {        mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;        mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;        mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;    }    //release semaphores to start computation    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)        ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);    //wait for all threads to finish    WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);    return dest;}int stopCopyThreads(){    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)    {        TerminateThread(hCopyThreads[ctr], 0);        CloseHandle(hCopyStartSemaphores[ctr]);        CloseHandle(hCopyStopSemaphores[ctr]);    }    return 0;}


I'm not sure if it's done in run time or if you have to do it compile time, but you should have SSE or similar extensions enabled as the vector unit often can write 128 bits to the memory compared to 64 bits for the CPU.

Try this implementation.

Yeah, and make sure that both the source and destination is aligned to 128 bits. If your source and destination are not aligned respective to each other your memcpy() will have to do some serious magic. :)


One thing to be aware of is that your process (and hence the performance of memcpy()) is impacted by the OS scheduling of tasks - it's hard to say how much of a factor this is in your timings, bu tit is difficult to control. The device DMA operation isn't subject to this, since it isn't running on the CPU once it's kicked off. Since your application is an actual real-time application though, you might want to experiment with Windows' process/thread priority settings if you haven't already. Just keep in mind that you have to be careful about this because it can have a really negative impact in other processes (and the user experience on the machine).

Another thing to keep in mind is that the OS memory virtualization might have an impact here - if the memory pages you're copying to aren't actually backed by physical RAM pages, the memcpy() operation will fault to the OS to get that physical backing in place. Your DMA pages are likely to be locked into physical memory (since they have to be for the DMA operation), so the source memory to memcpy() is likely not an issue in this regard. You might consider using the Win32 VirtualAlloc() API to ensure that your destination memory for the memcpy() is committed (I think VirtualAlloc() is the right API for this, but there might be a better one that I'm forgetting - it's been a while since I've had a need to do anything like this).

Finally, see if you can use the technique explained by Skizz to avoid the memcpy() altogether - that's your best bet if resources permit.