Parallel version of loop not faster than serial version Parallel version of loop not faster than serial version multithreading multithreading

Parallel version of loop not faster than serial version


Perform the computation on that particle, storing the result in a separate array

How heavy are computations?

  • Generally speaking atomic counter may cost hundreds of clock cycles and it is quite important tosee that you do not only increment counters.
  • Also try to see how much job each thread does - do they cooperate well (i.e. on each cycle each proceeds about half of particle).
  • Try to subdivide the job to bigger chunks then single particle (let's say 100 particles and so on).
  • See how much job is done outside of threads.

Honestly... it looks like what are you talking about is a bug.


profiling has not revealed much

This is unclear. I have experience profiling a multithreaded application on HP-UX and there their profiler says percent of time each function runs. So if you have one or few contention points in your functions you get increase in time your application spends in these functions. In my case I got significant increase in pthread_mutex_unlock(). When I changed my code it became much faster.

So could you post here the same statistics for one thread and for two/four threads. And number of computations in each test.

Also I recommend you (if it is possible) to set a breakpoint on global function locking a mutex. You might find that somewhere in your algorithm you incidentally lock a global mutex.


Your language is kind of revealing:

Wait on xxx

this might be your problem.


Plus you get slow when adding to a single result queue again - you might add the results only at the end of the processing into a single queue if possible. The main thread should not wait, buy check the global counter after every update.
Instead of profiling I would add performance counters which you log at the end. You may put them into conditional compilation error, so that they are not added to your production code.