Comparing Python, Numpy, Numba and C++ for matrix multiplication
Definitely use -O3
for optimization. This turns vectorizations on, which should significantly speed your code up.
Numba is supposed to do that already.
What I would recommend
If you want maximum efficiency, you should use a dedicated linear algebra library, the classic of which is BLAS/LAPACK libraries. There are a number of implementations, eg. Intel MKL. What you write is NOT going to outpeform hyper-optimized libraries.
Matrix matrix multiply is going to be the dgemm
routine: d stands for double, ge for general, and mm for matrix matrix multiply. If your problem has additional structure, a more specific function may be called for additional speedup.
Note that Numpy dot ALREADY calls dgemm
! You're probably not going to do better.
Why your c++ is slow
Your classic, intuitive algorithm for matrix-matrix multiplication turns out to be slow compared to what's possible. Writing code that takes advantage of how processors cache etc... yields important performance gains. The point is, tons of smart people have devoted their lives to making matrix matrix multiply extremely fast, and you should use their work and not reinvent the wheel.
In your current implementation most likely compiler is unable to auto vectorize the most inner loop because its size is 3. Also m2
is accessed in a "jumpy" way. Swapping loops so that iterating over p
is in the most inner loop will make it work faster (col
will not make "jumpy" data access) and compiler should be able to do better job (autovectorize).
for (int row = 0; row < m; row++) { for (int k = 0; k < n; k++) { for (int col = 0; col < p; col++) { m3.data_[p*row + col] += m1.data_[n*row + k] * m2.data_[p*k + col]; } }}
On my machine the original C++ implementation for p=10^6 elements build with g++ dot.cpp -std=c++11 -O3 -o dot
flags takes 12ms
and above implementation with swapped loops takes 7ms
.