Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine numpy numpy

Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine


I suspect this is due to unfortunate thread scheduling. I was able to reproduce an effect similar to yours. Python was running at ~2.2 s, while the C version was showing huge variations from 1.4-2.2 s.

Applying:KMP_AFFINITY=scatter,granularity=threadThis ensures that the 28 threads are always running on the same processor thread.

Reduces both runtimes to more stable ~1.24 s for C and ~1.26 s for python.

This is on a 28 core dual socket Xeon E5-2680 v3 system.

Interestingly, on a very similar 24 core dual socket Haswell system, both python and C perform almost identical even without thread affinity / pinning.

Why does python affect the scheduling? Well I assume there is more runtime environment around it. Bottom line is, without pinning your performance results will be non-deterministic.

Also you need to consider, that the Intel OpenMP runtime spawns an extra management thread that can confuse the scheduler. There are more choices for pinning, for instance KMP_AFFINITY=compact - but for some reason that is totally messed up on my system. You can add ,verbose to the variable to see how the runtime is pinning your threads.

likwid-pin is a useful alternative providing more convenient control.

In general single precision should be at least as fast as double precision. Double precision can be slower because:

  • You need more memory/cache bandwidth for double precision.
  • You can build ALUs that have higher througput for single precision, but that usually doesn't apply to CPUs but rather GPUs.

I would think that once you get rid of the performance anomaly, this will be reflected in your numbers.

When you scale up the number of threads for MKL/*gemm, consider

  • Memory /shared cache bandwidth may become a bottleneck, limiting the scalability
  • Turbo mode will effectively decrease the core frequency when increasing utilization. This applies even when you run at nominal frequency: On Haswell-EP processors, AVX instructions will impose a lower "AVX base frequency" - but the processor is allowed to exceed that when less cores are utilized / thermal headroom is available and in general even more for a short time. If you want perfectly neutral results, you would have to use the AVX base frequency, which is 1.9 GHz for you. It is documented here, and explained in one picture.

I don't think there is a really simple way to measure how your application is affected by bad scheduling. You can expose this with perf trace -e sched:sched_switch and there is some software to visualize this, but this will come with a high learning curve. And then again - for parallel performance analysis you should have the threads pinned anyway.