How to speed up matrix multiplication in C++?
Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k
and j
loop iterations:
matrix mult_std(matrix a, matrix b) { matrix c(a.dim(), false, false); for (int i = 0; i < a.dim(); i++) for (int k = 0; k < a.dim(); k++) for (int j = 0; j < a.dim(); j++) // swapped order c(i,j) += a(i,k) * b(k,j); return c;}
That's because a k
index on the inner-most loop will cause a cache miss in b
on every iteration. With j
as the inner-most index, both c
and b
are accessed contiguously, while a
stays put.
Make sure that the members dim()
and operator()()
are declared inline, and that compiler optimization is turned on. Then play with options like -funroll-loops
(on gcc).
How big is a.dim()
anyway? If a row of the matrix doesn't fit in just a couple cache lines, you'd be better off with a block access pattern instead of a full row at-a-time.
You say you don't want to modify the algorithm, but what does that mean exactly?
Does unrolling the loop count as "modifying the algorithm"? What about using SSE/VMX whichever SIMD instructions are available on your CPU? What about employing some form of blocking to improve cache locality?
If you don't want to restructure your code at all, I doubt there's more you can do than the changes you've already made. Everything else becomes a trade-off of minor changes to the algorithm to achieve a performance boost.
Of course, you should still take a look at the asm generated by the compiler. That'll tell you much more about what can be done to speed up the code.