OpenMP: sharing arrays between threads OpenMP: sharing arrays between threads multithreading multithreading

OpenMP: sharing arrays between threads


Congratulations! You have exposed yet another bad OpenMP implementation, courtesy of Microsoft. My initial theory was that the problem comes from the partitioned L3 cache in Sandy Bridge and later Intel CPUs. But the result from running the second loop only on the first half of the vector did not confirm that theory. Then it has to be something in the code generator that is triggered when OpenMP is enabled. The assembly output confirms this.

Basically the compiler does not optimise the serial loop when compiling with OpenMP enabled. That's where the slowdown comes from. Part of the problem was also introduced by yourself by making the second loop not identical to the first one. In the first loop you accumulate intermediate values into a temporary variable, which the compiler optimises to register variable, while in the second case you invoke operator[] on each iteration. When you compile without OpenMP enabled, the code optimiser transforms the second loop into something which is quite similar to the first loop, hence you get almost the same run time for both loops.

When you enable OpenMP, the code optimiser does not optimise the second loop and it runs way slower. The fact that your code executes a parallel block before that has nothing to do with the slowdown. My guess is that the code optimiser is unable to grasp the fact that vec1 is outside of the scope of the OpenMP parallel region and hence it should no longer be treated as shared variable and the loop can be optimised. Obviously this is a "feature", which was introduced in Visual Studio 2012, since the code generator in Visual Studio 2010 is able to optimise the second loop even with OpenMP enabled.

One possible solution would be to migrate to Visual Studio 2010. Another (hypothetical, since I don't have VS2012) solution would be to extract the second loop into a function and to pass the vector by reference to it. Hopefully the compiler would be smart enough to optimise the code in the separate function.

This is a very bad trend. Microsoft have practically given up on supporting OpenMP in Visual C++. Their implementation still (almost) conforms to OpenMP 2.0 only (hence no explicit tasks and other OpenMP 3.0+ goodies) and bugs like this one do not make things any better. I would recommend that you switch to another OpenMP enabled compiler (Intel C/C++ Compiler, GCC, anything non-Microsoft) or switch to some other compiler independent threading paradigm, for example Intel Threading Building Blocks. Microsoft is clearly pushing their parallel library for .NET and that's where all the development goes.


Big Fat Warning

Do not use clock() to measure the elapsed wall-clock time! This only works as expected on Windows. On most Unix systems (including Linux) clock() actually returns the total consumed CPU time by all threads in the process since it was created. This means that clock() may return values which are either several times larger than the elapsed wall-clock time (if the program runs with many busy threads) or several times shorter that the wall-clock time (if the program sleeps or waits on IO events between the measurements). Instead, in OpenMP programs, the portable timer function omp_get_wtime() should be used.