Why is the GNU scientific library matrix multiplication slower than numpy.matmul?
TL;DR: the C++ code and Numpy do not use the same matrix-multiplication library.
The matrix multiplication of the GSL library is not optimized. On my machine, it runs sequentially, does not use SIMD instructions (SSE/AVX), does not efficiently unroll the loops to perform register tiling. I also suspect it also does not use the CPU cache efficiently due to the lack of tiling. These optimizations are critical to achieve high-performance and widely used in fast linear algebra libraries.
Numpy uses a BLAS library installed on your machine. On many Linux platform, its uses OpenBLAS or the Intel MKL. Both are very fast (they use all the methods described above) and should run in parallel.
You can find which implementation of BLAS is used by Numpy here. On my Linux machine, Numpy use by default CBLAS which internally use OpenBLAS (OpenBLAS is strangely not directly detected by Numpy).
There are many fast parallel BLAS implementations (GotoBLAS, ATLAS, BLIS, etc.). The open-source BLIS library is great because its matrix multiplication is very fast on many different architectures.
As a result, the simplest way to improve your C++ code is to use the cblas_sgemm
CBLAS function and link a fast BLAS library like OpenBLAS or BLIS for example.
For more information:
One simple way to see how bad the GSL perform is to use a profiler (like perf on Linux or VTune on Windows). In your case Linux perf, report that >99% of the time is spent in libgslcblas.so
(ie. the GSL library). More specifically, most of the execution time is spent in this following assembly loop:
250: movss (%rdx),%xmm1 add $0x4,%rax add $0x4,%rdx mulss %xmm2,%xmm1 # scalar instructions addss -0x4(%rax),%xmm1 movss %xmm1,-0x4(%rax) cmp %rax,%r9 ↑ jne 250
As for Numpy, 99% of its time is spent in libopenblasp-r0.3.13.so
(ie. the OpenBLAS library). More specifically in the following assembly code of the function dgemm_kernel_HASWELL
:
110: lea 0x80(%rsp),%rsi add $0x60,%rsi mov %r12,%rax sar $0x3,%rax cmp $0x2,%rax ↓ jl d26 prefetcht0 0x200(%rdi) # Data prefetching vmovups -0x60(%rsi),%ymm1 prefetcht0 0xa0(%rsi) vbroadcastsd -0x80(%rdi),%ymm0 # Fast SIMD instruction (AVX) prefetcht0 0xe0(%rsi) vmovups -0x40(%rsi),%ymm2 prefetcht0 0x120(%rsi) vmovups -0x20(%rsi),%ymm3 vmulpd %ymm0,%ymm1,%ymm4 prefetcht0 0x160(%rsi) vmulpd %ymm0,%ymm2,%ymm8 vmulpd %ymm0,%ymm3,%ymm12 prefetcht0 0x1a0(%rsi) vbroadcastsd -0x78(%rdi),%ymm0 vmulpd %ymm0,%ymm1,%ymm5 vmulpd %ymm0,%ymm2,%ymm9 [...]
We can clearly see that the GSL code is not optimized (because of the scalar code and the naive simple loop) and that OpenBLAS code is optimized as it uses at least wide SIMD instructions, data prefetching and loop unroling. Note that the executed OpenBLAS code is not optimal as it could use the FMA instructions available on my processor.