Vectorize a 6 for loop cumulative sum in python Vectorize a 6 for loop cumulative sum in python numpy numpy

Vectorize a 6 for loop cumulative sum in python


EDIT 3:

Final (I think) version, a bit cleaner and faster incorporating ideas from max9111's answer.

import numpy as npfrom numba import as nb@nb.njit()def func1_jit(a, b, c, d):    # Precompute    exp_min = 5 - (a + b + c + d)    exp_max = b    exp = 2. ** np.arange(exp_min, exp_max + 1)    fact_e = np.empty((a + b - 2))    fact_e[0] = 1    for ei in range(1, len(fact_e)):        fact_e[ei] = ei * fact_e[ei - 1]    # Loops    B = 0    for ai in range(0, a):        for bi in range(0, b):            for ci in range(0, c):                for di in range(0, d):                    for ei in range(0, ai + bi):                        for fi in range(0, ci + di):                            B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]    return B

This is already quite faster than any of the previous options, but we are still not taking advantage of multiple CPUs. One way to do it is within the function itself, for example parallelizing the outer loop. This adds some overhead on each call to create the threads, so for small inputs is actually a bit slower, but should be significantly faster for bigger values:

import numpy as npfrom numba import as nb@nb.njit(parallel=True)def func1_par(a, b, c, d):    # Precompute    exp_min = 5 - (a + b + c + d)    exp_max = b    exp = 2. ** np.arange(exp_min, exp_max + 1)    fact_e = np.empty((a + b - 2))    fact_e[0] = 1    for ei in range(1, len(fact_e)):        fact_e[ei] = ei * fact_e[ei - 1]    # Loops    B = np.empty((a,))    for ai in nb.prange(0, a):        Bi = 0        for bi in range(0, b):            for ci in range(0, c):                for di in range(0, d):                    for ei in range(0, ai + bi):                        for fi in range(0, ci + di):                            Bi += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]        B[ai] = Bi    return np.sum(B)

Or, if you have many points where you want to evaluate the function, you can parallelize at that level too. Here a_arr, b_arr, c_arr and d_arr are vectors of values where the function is to be evaluated:

from numba import as nb@nb.njit(parallel=True)def func1_arr(a_arr, b_arr, c_arr, d_arr):    B_arr = np.empty((len(a_arr),))    for i in nb.prange(len(B_arr)):        B_arr[i] = func1_jit(a_arr[i], b_arr[i], c_arr[i], d_arr[i])    return B_arr

The best configuration depends on your inputs, usage pattern, hardware, etc. so you can combine the different ideas to suit your case.


EDIT 2:

Actually, forget what I said before. The best thing is to JIT-compile the algorithm, but in a more effective manner. Compute first the expensive parts (I took the exponential and the factorial) and then pass it to the compiled loopy function:

import numpy as npfrom numba import njitdef func1(a, b, c, d):    exp_min = 5 - (a + b + c + d)    exp_max = b    exp = 2. ** np.arange(exp_min, exp_max + 1)    ee = np.arange(a + b - 2)    fact_e = scipy.special.factorial(ee)    return func1_inner(a, b, c, d, exp_min, exp, fact_e)@njit()def func1_inner(a, b, c, d, exp_min, exp, fact_e):    B = 0    for ai in range(0, a):        for bi in range(0, b):            for ci in range(0, c):                for di in range(0, d):                    for ei in range(0, ai + bi):                        for fi in range(0, ci + di):                            B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]    return B

This is, in my experiments, the fastest option by far, and takes little extra memory (only the precomputed values, with size linear on the input).

a, b, c, d = 4, 6, 3, 4# The original function%timeit func1_orig(a, b, c, d)# 2.07 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)# The grid-evaluated function%timeit func1_grid(a, b, c, d)# 256 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)# The precompuation + JIT-compiled function%timeit func1_jit(a, b, c, d)# 19.6 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Well there is always the option of grid-evaluating the whole thing:

import numpy as npimport scipy.specialdef func1(a, b, c, d):    ai, bi, ci, di, ei, fi = np.ogrid[:a, :b, :c, :d, :a + b - 2, :c + d - 2]    # Compute    B = (2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei)    # Mask out of range elements for last two inner loops    m = (ei < ai + bi) & (fi < ci + di)    return np.sum(B * m)print(func1(4, 6, 3, 4))# 21769947.844726562

I used scipy.special.factorial because apparently np.factorial does not work with arrays for some reason.

Obivously, the memory cost of this will grow very fast as you increment the parameters. The code actually performs more computations than necessary, because the two inner loops have varying number of iterations, so (in this method) you have to use the largest and then remove what you don't need. The hope is that vectorization will make up for that. A small IPython benchmark:

a, b, c, d = 4, 6, 3, 4# func1_orig is the original loop-based version%timeit func1_orig(a, b, c, d)# 2.9 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)# func1 here is the vectorized version%timeit func1(a, b, c, d)# 210 µs ± 6.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

EDIT:

Note the previous approach is not an all-or-nothing thing either. You can choose to grid-evaluate only some of the loops. For example, the two innermost loops could be vectorized like this:

def func1(a, b, c, d):    B = 0    e = np.arange(a + b - 2).reshape((-1, 1))    f = np.arange(c + d - 2)    for ai in range(0, a):        for bi in range(0, b):            ei = e[:ai + bi]            for ci in range(0, c):                for di in range(0, d):                    fi = f[:ci + di]                    B += np.sum((2.) ** (ei - fi - ai - ci - di + 1) * (ei ** 2 - 2 * (ei * fi) - 7 * di) * scipy.special.factorial(ei))    return B

This still has loops, but it does avoid extra computations, and the memory requirements are much lower. Which one is best depends on the sizes of the inputs I guess. In my tests, with the original values (4, 6, 3, 4) this is even slower than the original function; also, for that case it seems that creating new arrays for ei and fi on each loop is faster than operating on a slice of a pre-created one. However, if you multiply the input by 4 (14, 24, 12, 16) then this is much faster than the original (about x5), although still slower than the fully vectorized one (about x3). On the other hand, I could compute the value for the input scaled by ten (40, 60, 30, 40) with this one (in ~5min) but not with the previous one because of the memory (I didn't test how long it'd take with the original function). Using @numba.jit helps a bit, although not enormously (cannot use nopython due to the factorial function). You can experiment with vectorizing more or less loops depending on the size of your inputs.


This is only a comment on @jdehesa answer.

If a function isn't supported by Numba itself it is usually recommendable to implement it yourself. In case of the factorization this isn't a complicated task.

Code

import numpy as npimport numba as nb@nb.njit()def factorial(a):  res=1.  for i in range(1,a+1):    res*=i  return res@nb.njit()def func1(a, b, c, d):    B = 0.    exp_min = 5 - (a + b + c + d)    exp_max = b    exp = 2. ** np.arange(exp_min, exp_max + 1)    fact_e=np.empty(a + b - 2)    for i in range(a + b - 2):      fact_e[i]=factorial(i)    for ai in range(0, a):        for bi in range(0, b):            for ci in range(0, c):                for di in range(0, d):                    for ei in range(0, ai + bi):                        for fi in range(0, ci + di):                            B += exp[ei - fi - ai - ci - di + 1 - exp_min] * (ei * ei - 2 * (ei * fi) - 7 * di) * fact_e[ei]    return B

Parallel version

@nb.njit(parallel=True)def func_p(a_vec,b_vec,c_vec,d_vec):  res=np.empty(a_vec.shape[0])  for i in nb.prange(a_vec.shape[0]):    res[i]=func1(a_vec[i], b_vec[i], c_vec[i], d_vec[i])  return res

Example

a_vec=np.random.randint(low=2,high=10,size=1000000)b_vec=np.random.randint(low=2,high=10,size=1000000)c_vec=np.random.randint(low=2,high=10,size=1000000)d_vec=np.random.randint(low=2,high=10,size=1000000)res_2=func_p(a_vec,b_vec,c_vec,d_vec)

The single threaded version leads to 5.6µs (after the first run) on your example.

The parallel version will almost lead to another Number_of_Cores speedup for calculating many values. Keep in mind that the compilation overhead is larger for the parallel version (above 0.5s for the first call).


Using this cartesian_product function you can transform your nested loop into matrices and then you can simply calculate your respective nested sigmas in a vectorized manner:

In [37]: def nested_sig(args):    ...:     base_prod = cartesian_product(*arrays)    ...:     second_prod = cartesian_product(base_prod[:,:2].sum(1), base_prod[:,2:].sum(1))    ...:     total = np.column_stack((base_prod, second_prod))    ...:     # the items in each row denotes the following variables in order:    ...:     # ai, bi, ci, di, ei, fi    ...:     x = total[:, 4] - total[:, 5] - total[:, 0] - total[:, 2] - total[:, 3] + 1    ...:     y = total[:, 4] - total[:, 5]    ...:     result = np.power(2, x) * (np.power(total[:, 4], 2) - 2*y - 7*total[:, 3]) * np.math.factorial(total[:,4])    ...:     return result