Comparing Python, Numpy, Numba and C++ for matrix multiplication Comparing Python, Numpy, Numba and C++ for matrix multiplication numpy numpy

Comparing Python, Numpy, Numba and C++ for matrix multiplication


Definitely use -O3 for optimization. This turns vectorizations on, which should significantly speed your code up.

Numba is supposed to do that already.


What I would recommend

If you want maximum efficiency, you should use a dedicated linear algebra library, the classic of which is BLAS/LAPACK libraries. There are a number of implementations, eg. Intel MKL. What you write is NOT going to outpeform hyper-optimized libraries.

Matrix matrix multiply is going to be the dgemm routine: d stands for double, ge for general, and mm for matrix matrix multiply. If your problem has additional structure, a more specific function may be called for additional speedup.

Note that Numpy dot ALREADY calls dgemm! You're probably not going to do better.

Why your c++ is slow

Your classic, intuitive algorithm for matrix-matrix multiplication turns out to be slow compared to what's possible. Writing code that takes advantage of how processors cache etc... yields important performance gains. The point is, tons of smart people have devoted their lives to making matrix matrix multiply extremely fast, and you should use their work and not reinvent the wheel.


In your current implementation most likely compiler is unable to auto vectorize the most inner loop because its size is 3. Also m2 is accessed in a "jumpy" way. Swapping loops so that iterating over p is in the most inner loop will make it work faster (col will not make "jumpy" data access) and compiler should be able to do better job (autovectorize).

for (int row = 0; row < m; row++) {    for (int k = 0; k < n; k++) {        for (int col = 0; col < p; col++) {            m3.data_[p*row + col] += m1.data_[n*row + k] * m2.data_[p*k + col];        }    }}

On my machine the original C++ implementation for p=10^6 elements build with g++ dot.cpp -std=c++11 -O3 -o dot flags takes 12ms and above implementation with swapped loops takes 7ms.