numba - guvectorize barely faster than jit

python performance numpy parallel-processing numba

That's because np.sum is too simple. Processing an array with sum is not only limited by CPU but also by the "memory access" time. So throwing more cores at it doesn't make much of a difference (of course that depends on how fast the memory access in relation to your CPU is).

Just for vizualisation np.sum is something like this (ignoring any parameter other than the data):

def sum(data):    sum_ = 0.    data = data.ravel()    for i in data.size:        item = data[i]   # memory access (I/O bound)        sum_ += item     # addition      (CPU bound)    return sum

So if most of the time is spent accessing the memory you won't see any real speedups if you parallize it. However if the CPU bound task is the bottleneck then using more cores will speedup your code significantly.

For example if you include some slower operations than addition you'll see a bigger improvement:

from math import sqrtfrom numba import njit, jit, guvectorizeimport timeitimport numpy as np@njitdef square_sum(arr):    a = 0.    for i in range(arr.size):        a = sqrt(a**2 + arr[i]**2)  # sqrt and square are cpu-intensive!    return a@guvectorize(["void(float64[:], float64[:])"], "(n) -> ()", target="parallel", nopython=True)def row_sum_gu(input, output) :    output[0] = square_sum(input)@jit(nopython=True)def row_sum_jit(input_array, output_array) :    m, n = input_array.shape    for i in range(m) :        output_array[i] = square_sum(input_array[i,:])    return output_array

I used IPythons timeit here but it should be equivalent:

rows = int(64)columns = int(1e6)input_array = np.random.random((rows, columns))output_array = np.zeros((rows))# Warmup an check that they are equal np.testing.assert_equal(row_sum_jit(input_array, output_array), row_sum_gu(input_array, output_array2))%timeit row_sum_jit(input_array, output_array.copy())  # 10 loops, best of 3: 130 ms per loop%timeit row_sum_gu(input_array, output_array.copy())   # 10 loops, best of 3: 35.7 ms per loop

I'm only using 4 cores so that's pretty close to the limit of possible speedup!

Just remember that parallel computation can only significantly speedup your calculation if the job is limited by the CPU.

CodeHunter

numba - guvectorize barely faster than jit

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last