Cython's prange not improving performance
I think this the parallelization is working, but the extra overhead of the parallelization is eating up the time it would have saved. If I try with different sized arrays then I do begin to see a speed up in the parallel version
XA = np.random.random((900, 2100))XB = np.random.random((100, 2100, 90))
Here the parallel version takes ~2/3 of the time of the serial version for me, which certainly isn't the 1/4 you'd expect, but does at least show some benefit.
One improvement I can offer is to replace the code that fixes contiguity:
XB = np.asanyarray([np.ascontiguousarray(XB[:,:,i]) for i in range(n)])
with
XB = np.ascontiguousarray(np.transpose(XB,[2,0,1]))
This speeds up both the parallel and non-parallel functions fairly significantly (a factor of 2 with the arrays you originally gave). It does make it slightly more obvious that you're being slowed down by overhead in the prange
- the serial version is actually faster for the arrays in your example.