C++ massive performance loss because of if statement C++ massive performance loss because of if statement multithreading multithreading

C++ massive performance loss because of if statement


You need to run a profiler to get to the bottom of this. On Linux, use perf.

My guess is that EnergyFunction::evaluate() is being entirely optimized away, because in the first examples, you don't use the result. So the compiler can discard the whole thing. You can try writing the return value to a volatile variable, which should force the compiler or linker to not optimize the call away. 1000x speed up is definitely not attributable to a simple comparison.


There is actually an atomic instruction to increase an int by 1. So a smart compiler may be able to entirely remove the mutex, altough I'd be surprised if it did. You can test this by looking at the assembly, or by removing the mutex and changing the type of overallGeneration to atomic<int> an check how fast it still is. This optimization is no longer possible with your last, slow example.

Also, if the compiler can see that evaluate does nothing to the global state and the result isn't used, then it can skip the entire call to evaluate. You can find out if that's the case by looking at the assembly or by removing the call to EnergyFunction::evaluate(sequence) and look at the timing - if it doesn't speed up, the function wasn't called in the first place. This optimization is no longer possible with your last, slow example. You should be able to stop the compiler from not executing EnergyFunction::evaluate(sequence) by defining the function in a different object file (other cpp or library) and disabling link time optimization.

There are other effects here that also create a performance difference, but I can't see any other effects that can explain a difference of factor 1000. A factor 1000 usually means the compiler cheated in the previous test and the change now prevents it from cheating.


I am not sure that my answer will give an explanation for such a dramatic performance drop but it definitely may have impact on it.

In the first case you added branches to the non-critical area:

if (dist(mt) == 0) {    sequence[distDim(mt)] = -1;} else {    sequence[distDim(mt)] = 1;}

In this case the CPU (at least IA) will perform branch prediction and in case of branch miss-prediction there is a performance penalty - this is a known fact.

Now regarding the second addition, you added a branch to the critical area:

mainMTX.lock();if(fitness < overallFitness)    overallFitness = fitness;overallGeneration++;mainMTX.unlock();

Which in its turn, in addition to the "miss-prediction" penalty increased the amount of code which is executed in that area and thus the probability that other threads will have to wait for mainMTX.unlock();.

NOTE

Please make sure that all the global/shared resources are defined as volatile. Otherwise the compiler may optimize them out (which might explain such high number of evaluations at the very beginning).

In case of overallFitness it probably won't be optimized out because it is declared as extern but overallGeneration may be optimized out. If this is the case, then it may explain this performance drop after adding the "real" memory access in the critical area.

NOTE2

I am still not sure that the explanation I provided may explain such significant performance drop. So I believe there might be some implementation details in the code which you didn't post (like volatile for example).

EDIT

As Peter (@Peter) Mark Lakata (@MarkLakata) stated in the separate answers, and I tend to agree with them, most likely that the reason for the performance drop is that in the first case fitness was never used so the compiler just optimized that variable out together with the function call. While in the second case fitness was used so the compiler didn't optimize it. Good catch Peter and Mark! I just missed that point.