Why is Cython so much slower than Numba when iterating over NumPy arrays? Why is Cython so much slower than Numba when iterating over NumPy arrays? numpy numpy

Why is Cython so much slower than Numba when iterating over NumPy arrays?


As @Antonio has pointed out, using pow for a simple multiplication is not very wise and leads to quite an overhead:

Thus, replacing pow(arr[i], 2) through arr[i]*arr[i] leads to a pretty large speed-up:

cython-pow-version        356 µsnumba-version              11 µscython-mult-version        14 µs

The remaining difference is probably due to difference between the compilers and levels of optimizations (llvm vs MSVC in my case). You might want to use clang to match numba performance (see for example this SO-answer)

In order to make the optimization easier for the compiler, you should declare the input as continuous array, i.e. double[::1] arr (see this question why it is important for vectorization), use @cython.boundscheck(False) (use option -a to see that there is less yellow) and also add compiler flags (i.e. -O3, -march=native or similar depending on your compiler to enable the vectorization, watch out for build-flags used by default which can inhibit some optimization, for example -fwrapv). In the end you might want to write the working-horse-loop in C, compile with the right combination of flags/compiler and use Cython to wrap it.

By the way, by typing function's paramters as nb.float64[:](nb.float64[:]) you decrease the performance of numba - it is no longer allowed to assume that the input array is continuous, thus ruling the vectorization out. Let numba detect the types (or define it as continuous, i.e. nb.float64[::1](nb.float64[::1]), and you will get better performance:

@nb.jit(nopython=True)def nb_vec_f(arr):   res=np.zeros(len(arr))   for i in range(len(arr)):       res[i]=(arr[i])**2   return res

Leads to the following improvement:

%timeit f(arr)  # numba version# 11.4 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)%timeit nb_vec_f(arr)# 7.03 µs ± 48.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And as pointed out by @max9111, we don't have to initialize the resulting array with zeros, but can use np.empty(...) instead of np.zeros(...) - this version even beats the numpy's np.square()

The performances of different approaches on my machine are:

numba+vectorization+empty     3µsnp.square                     4µsnumba+vectorization           7µsnumba missed vectorization   11µscython+mult                  14µscython+pow                  356µs