Why is numba faster than numpy here? Why is numba faster than numpy here? numpy numpy

Why is numba faster than numpy here?


I think this question highlights (somewhat) the limitations of calling out to precompiled functions from a higher level language. Suppose in C++ you write something like:

for (int i = 0; i != N; ++i) a[i] = b[i] + c[i] + 2 * d[i];

The compiler sees all this at compile time, the whole expression. It can do a lot of really intelligent things here, including optimizing out temporaries (and loop unrolling).

In python however, consider what's happening: when you use numpy each ''+'' uses operator overloading on the np array types (which are just thin wrappers around contiguous blocks of memory, i.e. arrays in the low level sense), and calls out to a fortran (or C++) function which does the addition super fast. But it just does one addition, and spits out a temporary.

We can see that in some way, while numpy is awesome and convenient and pretty fast, it is slowing things down because while it seems like it is calling into a fast compiled language for the hard work, the compiler doesn't get to see the whole program, it's just fed isolated little bits. And this is hugely detrimental to a compiler, especially modern compilers which are very intelligent and can retire multiple instructions per cycle when the code is well written.

Numba on the other hand, used a jit. So, at runtime it can figure out that the temporaries are not needed, and optimize them away. Basically, Numba has a chance to have the program compiled as a whole, numpy can only call small atomic blocks which themselves have been pre-compiled.


When you ask numpy to do:

x = x*2 - ( y * 55 )

It is internally translated to something like:

tmp1 = y * 55tmp2 = x * 2tmp3 = tmp2 - tmp1x = tmp3

Each of those temps are arrays that have to be allocated, operated on, and then deallocated. Numba, on the other hand, handles things one item at a time, and doesn't have to deal with that overhead.


Numba is generally faster than Numpy and even Cython (at least on Linux).

Here's a plot (stolen from Numba vs. Cython: Take 2):Benchmark on Numpy, Cython and Numba

In this benchmark, pairwise distances have been computed, so this may depend on the algorithm.

Note that this may be different on other Platforms, see this for Winpython (From WinPython Cython tutorial):

Benchmark on Numpy, Cython and Numba with Winpython