Why this numba code is 6x slower than numpy code? Why this numba code is 6x slower than numpy code? numpy numpy

Why this numba code is 6x slower than numpy code?


It is quite weird that numba can be so much slower.

It's not too weird. When you call NumPy functions inside a numba function you call the numba-version of these functions. These can be faster, slower or just as fast as the NumPy versions. You might be lucky or you can be unlucky (you were unlucky!). But even in the numba function you still create lots of temporaries because you use the NumPy functions (one temporary array for the dot result, one for each square and sum, one for the dot plus first sum) so you don't take advantage of the possibilities with numba.

Am I using it wrong?

Essentially: Yes.

I really need to speed this up

Okay, I'll give it a try.

Let's start with unrolling the sum of squares along axis 1 calls:

import numba as nb@nb.njitdef sum_squares_2d_array_along_axis1(arr):    res = np.empty(arr.shape[0], dtype=arr.dtype)    for o_idx in range(arr.shape[0]):        sum_ = 0        for i_idx in range(arr.shape[1]):            sum_ += arr[o_idx, i_idx] * arr[o_idx, i_idx]        res[o_idx] = sum_    return res@nb.njitdef euclidean_distance_square_numba_v1(x1, x2):    return -2 * np.dot(x1, x2.T) + np.expand_dims(sum_squares_2d_array_along_axis1(x1), axis=1) + sum_squares_2d_array_along_axis1(x2)

On my computer that's already 2 times faster than the NumPy code and almost 10 times faster than your original Numba code.

Speaking from experience getting it 2x faster than NumPy is generally the limit (at least if the NumPy version isn't needlessly complicated or inefficient), however you can squeeze out a bit more by unrolling everything:

import numba as nb@nb.njitdef euclidean_distance_square_numba_v2(x1, x2):    f1 = 0.    for i_idx in range(x1.shape[1]):        f1 += x1[0, i_idx] * x1[0, i_idx]    res = np.empty(x2.shape[0], dtype=x2.dtype)    for o_idx in range(x2.shape[0]):        val = 0        for i_idx in range(x2.shape[1]):            val_from_x2 = x2[o_idx, i_idx]            val += (-2) * x1[0, i_idx] * val_from_x2 + val_from_x2 * val_from_x2        val += f1        res[o_idx] = val    return res

But that only gives a ~10-20% improvement over the latest approach.

At that point you might realize that you can simplify the code (even though it probably won't speed it up):

import numba as nb@nb.njitdef euclidean_distance_square_numba_v3(x1, x2):    res = np.empty(x2.shape[0], dtype=x2.dtype)    for o_idx in range(x2.shape[0]):        val = 0        for i_idx in range(x2.shape[1]):            tmp = x1[0, i_idx] - x2[o_idx, i_idx]            val += tmp * tmp        res[o_idx] = val    return res

Yeah, that look pretty straight-forward and it's not really slower.

However in all the excitement I forgot to mention the obvious solution: scipy.spatial.distance.cdist which has a sqeuclidean (squared euclidean distance) option:

from scipy.spatial import distancedistance.cdist(x1, x2, metric='sqeuclidean')

It's not really faster than numba but it's available without having to write your own function...

Tests

Test for correctness and do the warmups:

x1 = np.array([[1.,2,3]])x2 = np.array([[1.,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6,7]])res1 = euclidean_distance_square(x1, x2)res2 = euclidean_distance_square_numba_original(x1, x2)res3 = euclidean_distance_square_numba_v1(x1, x2)res4 = euclidean_distance_square_numba_v2(x1, x2)res5 = euclidean_distance_square_numba_v3(x1, x2)np.testing.assert_array_equal(res1, res2)np.testing.assert_array_equal(res1, res3)np.testing.assert_array_equal(res1[0], res4)np.testing.assert_array_equal(res1[0], res5)np.testing.assert_almost_equal(res1, distance.cdist(x1, x2, metric='sqeuclidean'))

Timings:

x1 = np.random.random((1, 512))x2 = np.random.random((1000000, 512))%timeit euclidean_distance_square(x1, x2)# 2.09 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit euclidean_distance_square_numba_original(x1, x2)# 10.9 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit euclidean_distance_square_numba_v1(x1, x2)# 907 ms ± 7.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit euclidean_distance_square_numba_v2(x1, x2)# 715 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit euclidean_distance_square_numba_v3(x1, x2)# 731 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit distance.cdist(x1, x2, metric='sqeuclidean')# 706 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note: If you have arrays of integers you might want to change the hard-coded 0.0 in the numba functions to 0.


Despite the fact, that the answer of @MSeifert makes this answer quite obsolete, I'm still posting it, because it explains in more detail why the numba-version was slower than the numpy-version.

As we will see, the main culprit are the different memory access patterns for numpy and numba.

We can reproduce the behavior with much a simpler function:

import numpy as npimport numba as nbdef just_sum(x2):    return np.sum(x2, axis=1)@nb.jit('double[:](double[:, :])', nopython=True)def nb_just_sum(x2):    return np.sum(x2, axis=1)x2=np.random.random((2048,2048))

And now the timings:

>>> %timeit just_sum(x)2.33 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)>>> %timeit nb_just_sum(x)33.7 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

that means numpy is about 15 times faster!

When compiling the numba code with annotations (e.g. numba --annotate-html sum.html numba_sum.py) we can see, how the sum is performed by numba (see the whole listing of the summation in the appendix):

  1. initialize the result-column
  2. add the whole first column to the result-column
  3. add the whole second column to the result-column
  4. and so on

What is the problem of this approach? The memory layout! The array is stored in the row-major-order and thus reading it column-wise leads to much more cache-misses than reading it row-wise (which is what the numpy does). There is a great article which explains the possible cache effects.

As we can see, the sum-implementation of numba is yet not very mature. However, from the above consideration the numba-implementation could be competitive for column-major-order (i.e. transposed matrix):

>>> %timeit just_sum(x.T)3.09 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)>>> %timeit nb_just_sum(x.T)3.58 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and it really is.

As the code of @MSeifert has shown, the main advantage of numba is, that with its help we can reduce the number of temporary numpy-arrays. However, some things that looks easy are not easy at all and a naive solution can be pretty bad. Building a sum is such an operation - one should not think that a simple loop is good enough - see for example this question.


Listing numba-summation:

 Function name: array_sum_impl_axisin file: /home/ed/anaconda3/lib/python3.6/site-packages/numba/targets/arraymath.pywith signature: (array(float64, 2d, A), int64) -> array(float64, 1d, C)show numba IR194:    def array_sum_impl_axis(arr, axis):195:        ndim = arr.ndim196:    197:        if not is_axis_const:198:            # Catch where axis is negative or greater than 3.199:            if axis < 0 or axis > 3:200:                raise ValueError("Numba does not support sum with axis"201:                                 "parameter outside the range 0 to 3.")202:    203:        # Catch the case where the user misspecifies the axis to be204:        # more than the number of the array's dimensions.205:        if axis >= ndim:206:            raise ValueError("axis is out of bounds for array")207:    208:        # Convert the shape of the input array to a list.209:        ashape = list(arr.shape)210:        # Get the length of the axis dimension.211:        axis_len = ashape[axis]212:        # Remove the axis dimension from the list of dimensional lengths.213:        ashape.pop(axis)214:        # Convert this shape list back to a tuple using above intrinsic.215:        ashape_without_axis = _create_tuple_result_shape(ashape, arr.shape)216:        # Tuple needed here to create output array with correct size.217:        result = np.full(ashape_without_axis, zero, type(zero))218:    219:        # Iterate through the axis dimension.220:        for axis_index in range(axis_len):221:            if is_axis_const:222:                # constant specialized version works for any valid axis value223:                index_tuple_generic = _gen_index_tuple(arr.shape, axis_index,224:                                                       const_axis_val)225:                result += arr[index_tuple_generic]226:            else:227:                # Generate a tuple used to index the input array.228:                # The tuple is ":" in all dimensions except the axis229:                # dimension where it is "axis_index".230:                if axis == 0:231:                    index_tuple1 = _gen_index_tuple(arr.shape, axis_index, 0)232:                    result += arr[index_tuple1]233:                elif axis == 1:234:                    index_tuple2 = _gen_index_tuple(arr.shape, axis_index, 1)235:                    result += arr[index_tuple2]236:                elif axis == 2:237:                    index_tuple3 = _gen_index_tuple(arr.shape, axis_index, 2)238:                    result += arr[index_tuple3]239:                elif axis == 3:240:                    index_tuple4 = _gen_index_tuple(arr.shape, axis_index, 3)241:                    result += arr[index_tuple4]242:    243:        return result 


This is a comment to @MSeifert answer.There are some more things to gain performance. As in every numerical code it is recommendable to think about which datatype is precise enough for your problem. Often float32 is also enough, sometimes even float64 isn't enough.

I want also to mention the fastmath keyword here, which can give another 1.7x speed up here.

[Edit]

For a simple summation I looked into the LLVM-code and found that the sumation was splitted up in partial sums on vectorization. (4 partial sums for double and 8 for float using AVX2). This has to be investigated further.

Code

import llvmlite.binding as llvmllvm.set_option('', '--debug-only=loop-vectorize')@nb.njitdef euclidean_distance_square_numba_v3(x1, x2):    res = np.empty(x2.shape[0], dtype=x2.dtype)    for o_idx in range(x2.shape[0]):        val = 0        for i_idx in range(x2.shape[1]):            tmp = x1[0, i_idx] - x2[o_idx, i_idx]            val += tmp * tmp        res[o_idx] = val    return res@nb.njit(fastmath=True)def euclidean_distance_square_numba_v4(x1, x2):    res = np.empty(x2.shape[0], dtype=x2.dtype)    for o_idx in range(x2.shape[0]):        val = 0.        for i_idx in range(x2.shape[1]):            tmp = x1[0, i_idx] - x2[o_idx, i_idx]            val += tmp * tmp        res[o_idx] = val    return res@nb.njit(fastmath=True,parallel=True)def euclidean_distance_square_numba_v5(x1, x2):    res = np.empty(x2.shape[0], dtype=x2.dtype)    for o_idx in nb.prange(x2.shape[0]):        val = 0.        for i_idx in range(x2.shape[1]):            tmp = x1[0, i_idx] - x2[o_idx, i_idx]            val += tmp * tmp        res[o_idx] = val    return res

Timings

float64x1 = np.random.random((1, 512))x2 = np.random.random((1000000, 512))0.42 v3 @MSeifert0.25 v40.18 v5 parallel-version0.48 distance.cdistfloat32x1 = np.random.random((1, 512)).astype(np.float32)x2 = np.random.random((1000000, 512)).astype(np.float32)0.09 v5

How to explicitly declare types

In general I wouldn't recommend this. Your input arrays can be C-contigous (as the testdata) Fortran contigous or strided. If you know that your data is always C-contiguos you can write

@nb.njit('double[:](double[:, ::1],double[:, ::1])',fastmath=True)def euclidean_distance_square_numba_v6(x1, x2):    res = np.empty(x2.shape[0], dtype=x2.dtype)    for o_idx in range(x2.shape[0]):        val = 0.        for i_idx in range(x2.shape[1]):            tmp = x1[0, i_idx] - x2[o_idx, i_idx]            val += tmp * tmp        res[o_idx] = val    return res

This offers the same performance than the v4 version, but will fail if the input arrays are not C-contigous or not of dtype=np.float64.

You can also use

@nb.njit('double[:](double[:, :],double[:, :])',fastmath=True)def euclidean_distance_square_numba_v7(x1, x2):    res = np.empty(x2.shape[0], dtype=x2.dtype)    for o_idx in range(x2.shape[0]):        val = 0.        for i_idx in range(x2.shape[1]):            tmp = x1[0, i_idx] - x2[o_idx, i_idx]            val += tmp * tmp        res[o_idx] = val    return res

This will also work on strided arrays, but will be much slower than the version above on C-contigous arrays. (0.66s vs. 0.25s). Please note also that your problem is quite limited by memory bandwidth. The difference can be higher with CPU bound calculations.

If you let do Numba the job for you, it will be automaticly detected if the array is contigous or not (providing contgous input data on the first try and than non contigous data, will lead to a recompilation)