May compiler optimizations be inhibited by multi-threading? May compiler optimizations be inhibited by multi-threading? multithreading multithreading

May compiler optimizations be inhibited by multi-threading?


Short from the explicit pragmas for OMP, compilers just don't know that code can be executed by multiple threads. So they can neither make that code more nor less efficient.

This has grave consequences in C++. It is particularly a problem to library authors, they cannot reasonably guess up front whether their code will be used in a program that uses threading. Very visible when you read the source of a common C-runtime and standard C++ library implementation. Such code tends to be peppered with little locks all over the place to ensure the code still operates correctly when it is used in threads. You pay for this, even if you don't actually use that code in a threaded way. A good example is std::shared_ptr<>. You pay for the atomic update of the reference count, even if the smart pointer is only ever used in one thread. And the standard doesn't provide a way to ask for non-atomic updates, a proposal to add the feature was rejected.

And it very much is detrimental the other way as well, there isn't anything the compiler can do to ensure your own code is thread-safe. It is entirely up to you to make it thread-safe. Hard to do and this goes wrong in subtle and very difficult to diagnose ways all the time.

Big problems, not simple to solve. Maybe that's a good thing, otherwise everybody could be a programmer ;)


I think this answer describes the reason sufficiently, but I'll expand a bit here.

Before, however, here's gcc 4.8's documentation on -fopenmp:

-fopenmp
Enable handling of OpenMP directives #pragma omp in C/C++ and !$omp in Fortran. When -fopenmp is specified, the compiler generates parallel code according to the OpenMP Application Program Interface v3.0 http://www.openmp.org/. This option implies -pthread, and thus is only supported on targets that have support for -pthread.

Note that it doesn't specify disabling of any features. Indeed, there is no reason for gcc to disable any optimization.

The reason however why openmp with 1 thread has overhead with respect to no openmp is the fact that the compiler needs to convert the code, adding functions so it would be ready for cases with openmp with n>1 threads. So let's think of a simple example:

int *b = ...int *c = ...int a = 0;#omp parallel for reduction(+:a)for (i = 0; i < 100; ++i)    a += b[i] + c[i];

This code should be converted to something like this:

struct __omp_func1_data{    int start;    int end;    int *b;    int *c;    int a;};void *__omp_func1(void *data){    struct __omp_func1_data *d = data;    int i;    d->a = 0;    for (i = d->start; i < d->end; ++i)        d->a += d->b[i] + d->c[i];    return NULL;}...for (t = 1; t < nthreads; ++t)    /* create_thread with __omp_func1 function *//* for master thread, don't create a thread */struct master_data md = {    .start = /*...*/,    .end = /*...*/    .b = b,    .c = c};__omp_func1(&md);a += md.a;for (t = 1; t < nthreads; ++t){    /* join with thread */    /* add thread_data->a to a */}

Now if we run this with nthreads==1, the code effectively gets reduced to:

struct __omp_func1_data{    int start;    int end;    int *b;    int *c;    int a;};void *__omp_func1(void *data){    struct __omp_func1_data *d = data;    int i;    d->a = 0;    for (i = d->start; i < d->end; ++i)        d->a += d->b[i] + d->c[i];    return NULL;}...struct master_data md = {    .start = 0,    .end = 100    .b = b,    .c = c};__omp_func1(&md);a += md.a;

So what are the differences between the no openmp version and the single threaded openmp version?

One difference is that there is extra glue code. The variables that need to be passed to the function created by openmp need to be put together to form one argument. So there is some overhead preparing for the function call (and later retrieving data)

More importantly however, is that now the code is not in one piece any more. Cross-function optimization is not so advanced yet and most optimizations are done within each function. Smaller functions means there is smaller possibility to optimize.


To finish this answer, I'd like to show you exactly how -fopenmp affects gcc's options. (Note: I'm on an old computer now, so I have gcc 4.4.3)

Running gcc -Q -v some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106options passed:  -v a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486 -fstack-protectoroptions enabled:  -falign-loops -fargument-alias -fauto-inc-dec -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident -finline-functions-called-once -fira-share-save-slots -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno -fmerge-debug-strings -fmove-loop-invariants -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double -maccumulate-outgoing-args -malign-stringops -mfancy-math-387 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4 -mpush-args -msahf -mtls-direct-seg-refs

and running gcc -Q -v -fopenmp some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106options passed:  -v -D_REENTRANT a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486 -fopenmp -fstack-protectoroptions enabled:  -falign-loops -fargument-alias -fauto-inc-dec -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident -finline-functions-called-once -fira-share-save-slots -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore -fmath-errno -fmerge-debug-strings -fmove-loop-invariants -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops= -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double -maccumulate-outgoing-args -malign-stringops -mfancy-math-387 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4 -mpush-args -msahf -mtls-direct-seg-refs

Taking a diff, we can see that the only difference is that with -fopenmp, we have -D_REENTRANT defined (and of course -fopenmp enabled). So, rest assured, gcc wouldn't produce worse code. It's just that it needs to add preparation code for when number of threads is greater than 1 and that has some overhead.


Update: I really should have tested this with optimization enabled. Anyway, with gcc 4.7.3, the output of the same commands, added -O3 will give the same difference. So, even with -O3, there are no optimization's disabled.


That's a good question, even if it's rather broad, and I'm looking forward to hearing from the experts. I think @JimCownie had a good comment about this at the following discussion Reasons for omp_set_num_threads(1) slower than no openmp

Auto-vectorization and parallelization I think are often a problem. If you turn on Auto-parallelization in MSVC 2012 (auto-vectorization is on my default) they seem not to mix well together. Using OpenMP seems to disable the auto-vectorization of MSVC. The same maybe be true for GCC with OpenMP and auto-vectorization but I'm not sure.

I don't really trust auto-vectorization in the compiler anyway. One reason is that I'm not sure it does it does loop-unrolling to eliminate carried loop dependencies as well as scalar code. For this reason I try and do these things myself. I do the vectorization myself (using Agner Fog's vector class) and I unroll the loops myself. By doing this by hand I feel more confidant that I maximize all the parallelism: TLP (e.g. with OpenMP), ILP (e.g. by removing data dependencies with loop unrolling), and SIMD (with explicit SSE/AVX code).