What's the meaning of thread concurrency overhead time in the profiler output? What's the meaning of thread concurrency overhead time in the profiler output? multithreading multithreading

What's the meaning of thread concurrency overhead time in the profiler output?


I am also not much of an expert on that, though I have tried to use pthread a bit myself.

To demonstrate my understanding of overhead time, let us take the example of a simple single-threaded program to compute an array sum:

for(i=0;i<NUM;i++) {    sum += array[i];}

In a simple [reasonably done] multi-threaded version of that code, the array could be broken into one piece per thread, each thread keeps its own sum, and after the threads are done, the sums are summed.

In a very poorly written multi-threaded version, the array could be broken down as before, and every thread could atomicAdd to a global sum.

In this case, the atomic addition can only be done by one thread at a time. I believe that overhead time is a measure of how long all of the other threads spend while waiting to do their own atomicAdd (you could try writing this program to check if you want to be sure).

Of course, it also takes into account the time it takes to deal with switching the semaphores and mutexes around. In your case, it probably means a significant amount of time is spent on the internals of the mutex.lock and mutex.unlock.

I parallelized a piece of software a while ago (using pthread_barrier), and had issues where it took longer to run the barriers than it did to just use one thread. It turned out that the loop that had to have 4 barriers in it was executed quickly enough to make the overhead not worth it.


Sorry, I'm not an expert on pthread or Intel VTune Amplifier, but yes, locking a mutex and unlocking it will probably count as overhead time.

Locking and unlocking mutexes can be implemented as system calls, which the profiler probably would just lump under threading overhead.


I'm not familiar with vTune but there is an in the OS overhead switching between threads. Each time a thread stops and another loads on a processor the current thread context needs to be stored so that it can be restored when the thread next runs and then the new thread's context needs to be restored so it can carry on processing.

The problem may be that you have too many threads and so the processor is spending most of its time switching between them. Multi threaded applications will run most efficiently if there are the same number of threads as processors.