Why doesn't this code scale linearly? Why doesn't this code scale linearly? multithreading multithreading

Why doesn't this code scale linearly?


The sparse matrix vector multiplication is memory bound (see here) and it could be shown with a simple roofline model. Memory bound problems benefit from higher memory bandwidth of multisocket NUMA systems but only if the data initialisation is done in such a way that the data is distributed among the two NUMA domains. I have some reasons to believe that you are loading the matrix in serial and therefore all its memory is allocated on a single NUMA node. In that case you won't benefit from the double memory bandwidth available on a dual-socket system and it really doesn't matter if you use schedule(dynamic) or schedule(static). What you could do is enable memory interleaving NUMA policy in order to have the memory allocation spread among both NUMA nodes. Thus each thread would end up with 50% local memory access and 50% remote memory access instead of having all threads on the second CPU being hit by 100% remote memory access. The easiest way to enable the policy is by using numactl:

$ OMP_NUM_THREADS=... OMP_PROC_BIND=1 numactl --interleave=all ./program ...

OMP_PROC_BIND=1 enables thread pinning and should improve the performance a bit.

I would also like to point out that this:

done = true;#pragma omp parallel shared(done){    bool tdone = true;    // ...    #pragma omp atomic update    done &= tdone;}

is a probably a not very efficient re-implementation of:

done = true;#pragma omp parallel reduction(&:done){    // ...        if(residual > tolerance) {            done = false;        }    // ...}

It won't have a notable performance difference between the two implementations because of the amount of work done in the inner loop, but still it is not a good idea to reimplement existing OpenMP primitives for the sake of portability and readability.


Try running the IPCM (Intel Performance Counter Monitor). You can watch memory bandwidth, and see if it maxes out with more cores. My gut feeling is that you are memory bandwidth limited.

As a quick back of the envelope calculation, I find that uncached read bandwidth is about 10 GB/s on a Xeon. If your clock is 2.5 GHz, that's one 32 bit word per clock cycle. Your inner loop is basically just a multiple-add operation whose cycles you can count on one hand, plus a few cycles for the loop overhead. It doesn't surprise me that after 10 threads, you don't get any performance gain.


Your inner loop has an omp atomic read, and your middle loop has an omp atomic write to a location that could be the same one read by one of the reads. OpenMP is obligated to ensure that atomic writes and reads of the same location are serialized, so in fact it probably does need to introduce a lock, even though there isn't any explicit one.

It might even need to lock the whole sol array unless it can somehow figure out which reads might conflict with which writes, and really, OpenMP processors aren't necessarily all that smart.

No code scales absolutely linearly, but rest assured that there are many codes that do scale much closer to linearly than yours does.