Why is the GNU scientific library matrix multiplication slower than numpy.matmul? Why is the GNU scientific library matrix multiplication slower than numpy.matmul? numpy numpy

Why is the GNU scientific library matrix multiplication slower than numpy.matmul?


TL;DR: the C++ code and Numpy do not use the same matrix-multiplication library.

The matrix multiplication of the GSL library is not optimized. On my machine, it runs sequentially, does not use SIMD instructions (SSE/AVX), does not efficiently unroll the loops to perform register tiling. I also suspect it also does not use the CPU cache efficiently due to the lack of tiling. These optimizations are critical to achieve high-performance and widely used in fast linear algebra libraries.

Numpy uses a BLAS library installed on your machine. On many Linux platform, its uses OpenBLAS or the Intel MKL. Both are very fast (they use all the methods described above) and should run in parallel.

You can find which implementation of BLAS is used by Numpy here. On my Linux machine, Numpy use by default CBLAS which internally use OpenBLAS (OpenBLAS is strangely not directly detected by Numpy).

There are many fast parallel BLAS implementations (GotoBLAS, ATLAS, BLIS, etc.). The open-source BLIS library is great because its matrix multiplication is very fast on many different architectures.

As a result, the simplest way to improve your C++ code is to use the cblas_sgemm CBLAS function and link a fast BLAS library like OpenBLAS or BLIS for example.


For more information:

One simple way to see how bad the GSL perform is to use a profiler (like perf on Linux or VTune on Windows). In your case Linux perf, report that >99% of the time is spent in libgslcblas.so (ie. the GSL library). More specifically, most of the execution time is spent in this following assembly loop:

250:   movss   (%rdx),%xmm1       add     $0x4,%rax       add     $0x4,%rdx       mulss   %xmm2,%xmm1           # scalar instructions       addss   -0x4(%rax),%xmm1       movss   %xmm1,-0x4(%rax)       cmp     %rax,%r9     ↑ jne     250

As for Numpy, 99% of its time is spent in libopenblasp-r0.3.13.so (ie. the OpenBLAS library). More specifically in the following assembly code of the function dgemm_kernel_HASWELL:

110:   lea          0x80(%rsp),%rsi        add          $0x60,%rsi        mov          %r12,%rax        sar          $0x3,%rax        cmp          $0x2,%rax      ↓ jl           d26        prefetcht0   0x200(%rdi)          # Data prefetching       vmovups      -0x60(%rsi),%ymm1        prefetcht0   0xa0(%rsi)       vbroadcastsd -0x80(%rdi),%ymm0    # Fast SIMD instruction (AVX)       prefetcht0   0xe0(%rsi)       vmovups      -0x40(%rsi),%ymm2        prefetcht0   0x120(%rsi)       vmovups      -0x20(%rsi),%ymm3        vmulpd       %ymm0,%ymm1,%ymm4       prefetcht0   0x160(%rsi)       vmulpd       %ymm0,%ymm2,%ymm8        vmulpd       %ymm0,%ymm3,%ymm12        prefetcht0   0x1a0(%rsi)       vbroadcastsd -0x78(%rdi),%ymm0        vmulpd       %ymm0,%ymm1,%ymm5        vmulpd       %ymm0,%ymm2,%ymm9        [...]

We can clearly see that the GSL code is not optimized (because of the scalar code and the naive simple loop) and that OpenBLAS code is optimized as it uses at least wide SIMD instructions, data prefetching and loop unroling. Note that the executed OpenBLAS code is not optimal as it could use the FMA instructions available on my processor.