Numpy optimization with Numba Numpy optimization with Numba numpy numpy

Numpy optimization with Numba


Unlike list.append you should never call numpy.append in a loop! This is because even for appending a single element the whole array needs to be copied. Because you're only interested in the unique obj you could use a Boolean array to flag the matches found so far.

As for Numba, it works best if you write out all the loops. So for example:

@jit(nopython=True)def numba2(vec_obj, vec_ps, cos_maxsep):    nps = vec_ps.shape[0]    nobj = vec_obj.shape[0]    dim = vec_obj.shape[1]    found = np.zeros(nobj, np.bool_)    for i in range(nobj):        for j in range(nps):            cos = 0.0            for k in range(dim):                cos += vec_obj[i,k] * vec_ps[j,k]            if cos > cos_maxsep:                found[i] = True                break    return found.nonzero()

The added benefit is that we can break out of the loop over the ps array as soon as we find a match to the current obj.

You can gain some more speed by specializing the function for 3 dimensional spaces. Also, for some reason, passing all arrays and relevant dimensions into a helper function results in another speedup:

def numba3(vec_obj, vec_ps, cos_maxsep):    nps = len(vec_ps)    nobj = len(vec_obj)    out = np.zeros(nobj, bool)    numba3_helper(vec_obj, vec_ps, cos_maxsep, out, nps, nobj)    return np.flatnonzero(out)@jit(nopython=True)def numba3_helper(vec_obj, vec_ps, cos_maxsep, out, nps, nobj):    for i in range(nobj):        for j in range(nps):            cos = (vec_obj[i,0]*vec_ps[j,0] +                    vec_obj[i,1]*vec_ps[j,1] +                    vec_obj[i,2]*vec_ps[j,2])            if cos > cos_maxsep:                out[i] = True                break    return out

Timings I get for 20,000 obj and 2,000 ps:

%timeit angdist_threshold_numba(vec_obj,vec_ps,cos_maxsep)1 loop, best of 3: 2.99 s per loop%timeit numba2(vec_obj, vec_ps, cos_maxsep)1 loop, best of 3: 444 ms per loop%timeit numba3(vec_obj, vec_ps, cos_maxsep)10 loops, best of 3: 134 ms per loop