Numpy sum running length of non-zero values
This post lists a vectorized approach which basically consists of two steps:
Initialize a zeros vector of the same size as input vector, x and set ones at places corresponding to non-zeros of
x
.Next up, in that vector, we need to put minus of runlengths of each island right after the ending/stop positions for each "island". The intention is to use cumsum again later on, which would result in sequential numbers for the "islands" and zeros elsewhere.
Here's the implementation -
import numpy as np#Append zeros at the start and end of input array, xxa = np.hstack([[0],x,[0]])# Get an array of ones and zeros, with ones for nonzeros of x and zeros elsewherexa1 =(xa!=0)+0# Find consecutive differences on xa1xadf = np.diff(xa1)# Find start and stop+1 indices and thus the lengths of "islands" of non-zerosstarts = np.where(xadf==1)[0]stops_p1 = np.where(xadf==-1)[0]lens = stops_p1 - starts# Mark indices where "minus ones" are to be put for applying cumsumput_m1 = stops_p1[[stops_p1 < x.size]]# Setup vector with ones for nonzero x's, "minus lens" at stops +1 & zeros elsewherevec = xa1[1:-1] # Note: this will change xa1, but it's okay as not needed anymorevec[put_m1] = -lens[0:put_m1.size]# Perform cumsum to get the desired outputout = vec.cumsum()
Sample run -
In [116]: xOut[116]: array([ 0. , 2.3, 1.2, 4.1, 0. , 0. , 5.3, 0. , 1.2, 3.1, 0. ])In [117]: outOut[117]: array([0, 1, 2, 3, 0, 0, 1, 0, 1, 2, 0], dtype=int32)
Runtime tests -
Here's some runtimes tests comparing the proposed approach against the other itertools.groupby based approach
-
In [21]: N = 1000000 ...: x = np.random.rand(1,N) ...: x[x>0.5] = 0.0 ...: x = x.ravel() ...: In [19]: %timeit sumrunlen_vectorized(x)10 loops, best of 3: 19.9 ms per loopIn [20]: %timeit sumrunlen_loopy(x)1 loops, best of 3: 2.86 s per loop
You can use itertools.groupby
and np.hstack
:
>>> import numpy as np>>> x = np.array([2.3, 1.2, 4.1 , 0.0, 0.0, 5.3, 0, 1.2, 3.1])>>> from itertools import groupby>>> np.hstack([[i if j!=0 else j for i,j in enumerate(g,1)] for _,g in groupby(x,key=lambda x: x!=0)])array([ 1., 2., 3., 0., 0., 1., 0., 1., 2.])
We can group the array elements based on non-zero elements then use a list comprehension and enumerate to replace the non-zero sub-arrays with those index then flatten the list with np.hstack
.
This sub-problem came up in Kick Start 2021 Round A for me. My solution:
def current_run_len(a): a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined d = np.diff(a_) starts = np.where(d == 1)[0] stops = np.where(d == -1)[0] a_[stops + 1] = -(stops - starts) # +1 for behind-last return a_[1:-1].cumsum()
In fact, the problem also required a version where you count down consecutive sequences. Thus here another version with an optional keyword argument which does the same for rev=False
:
def current_run_len(a, rev=False): a_ = np.hstack([0, a != 0, 0]) # first in starts and last in stops defined d = np.diff(a_) starts = np.where(d == 1)[0] stops = np.where(d == -1)[0] if rev: a_[starts] = -(stops - starts) cs = -a_.cumsum()[:-2] else: a_[stops + 1] = -(stops - starts) # +1 for behind-last cs = a_.cumsum()[1:-1] return cs
Results:
a = np.array([1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1])print('a = ', a)print('current_run_len(a) = ', current_run_len(a))print('current_run_len(a, rev=True) = ', current_run_len(a, rev=True))
a = [1 1 1 1 0 0 0 1 1 0 1 0 0 0 1]current_run_len(a) = [1 2 3 4 0 0 0 1 2 0 1 0 0 0 1]current_run_len(a, rev=True) = [4 3 2 1 0 0 0 2 1 0 1 0 0 0 1]
For an array that consists of 0s and 1s only, you can simplify [0, a != 0, 0]
to [0, a, 0]
. But the version as-posted also works for arbitrary non-zero numbers.