NumPy grouping using itertools.groupby performance NumPy grouping using itertools.groupby performance numpy numpy

NumPy grouping using itertools.groupby performance


I get a three times improvement doing something like this:

def group():    import numpy as np    values = np.array(np.random.randint(0, 3298, size=35000000), dtype='u4')    values.sort()    dif = np.ones(values.shape, values.dtype)    dif[1:] = np.diff(values)    idx = np.where(dif>0)    vals = values[idx]    count = np.diff(idx)


More than 5 years have passed since Paul's answer was accepted. Interestingly,the sort() is still the bottleneck in the accepted solution.

Line #      Hits         Time  Per Hit   % Time  Line Contents==============================================================     3                                           @profile     4                                           def group_paul():     5         1        99040  99040.0      2.4      import numpy as np     6         1       305651 305651.0      7.4      values = np.array(np.random.randint(0, 2**32,size=35000000),dtype='u4')     7         1      2928204 2928204.0    71.3      values.sort()     8         1        78268  78268.0      1.9      diff = np.concatenate(([1],np.diff(values)))     9         1       215774 215774.0      5.3      idx = np.concatenate((np.where(diff)[0],[len(values)]))    10         1           95     95.0      0.0      index = np.empty(len(idx)-1,dtype='u4,u2')    11         1       386673 386673.0      9.4      index['f0'] = values[idx[:-1]]    12         1        91492  91492.0      2.2      index['f1'] = np.diff(idx)

The accepted solution runs for 4.0 s on my machine, with radix sort itdrops down to 1.7 s.

Just by switching to radix sort, I get an overall 2.35x speedup. The radix sort is more than 4x faster than quicksort in this case.

See How to sort an array of integers faster than quicksort? that was motivated by your question.


For the profiling I used line_profiler and kernprof (the @profile comes from there).


By request, here is a Cython version of this. I did two passes through the array. The first one finds out how many unique elements there are so that can my arrays for the unique values and counts of the appropriate size.

import numpy as npcimport numpy as npcimport cython@cython.boundscheck(False)def dogroup():    cdef unsigned long tot = 1    cdef np.ndarray[np.uint32_t, ndim=1] values = np.array(np.random.randint(35000000,size=35000000),dtype=np.uint32)    cdef unsigned long i, ind, lastval    values.sort()    for i in xrange(1,len(values)):        if values[i] != values[i-1]:            tot += 1    cdef np.ndarray[np.uint32_t, ndim=1] vals = np.empty(tot,dtype=np.uint32)    cdef np.ndarray[np.uint32_t, ndim=1] count = np.empty(tot,dtype=np.uint32)    vals[0] = values[0]    ind = 1    lastval = 0    for i in xrange(1,len(values)):        if values[i] != values[i-1]:            vals[ind] = values[i]            count[ind-1] = i - lastval            lastval = i            ind += 1    count[ind-1] = len(values) - lastval

The sorting is actually taking the most time here by far. Using the values array given in my code, the sorting is taking 4.75 seconds and the actual finding of the unique values and counts takes .67 seconds. With the pure Numpy code using Paul's code (but with the same form of the values array) with the fix I suggested in a comment, finding the unique values and counts takes 1.9 seconds (sorting still takes the same amount of time of course).

It makes sense for most of the time to be taken up by the sorting because it is O(N log N) and the counting is O(N). You can speed up the sort a little bit over Numpy's (which uses C's qsort if I remember correctly), but you have to really know what you are doing and it probably isn't worthwhile. Also, there might be some way to speed up my Cython code a little bit more, but it probably isn't worthwhile.