Code runs 6 times slower with 2 threads than with 1

c++ multithreading performance optimization

Why are 2 threads 6x slower than 1 thread?

You are getting hit by a bad case of false sharing.

After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?

You are bottlenecked by your memory bandwidth.

False Sharing:

The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.

Each thread is running this loop:

for(uint32_t i = 0; i < length; i ++) {    *result += (*datavec).at(start + i);}

And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.

Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.

Memory Bandwidth:

Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.

Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.

In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.

c++ multithreading performance optimization

As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.

Here are the relevant parts of the code:

thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));

and

void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result){    *result = std::accumulate(start, end, 0.0);}

This consistently gives me better performance for two threads on my 32-bit netbook.

c++ multithreading performance optimization

More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.

CodeHunter

Code runs 6 times slower with 2 threads than with 1

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last