Split numpy array into chunks by maxmimum size Split numpy array into chunks by maxmimum size numpy numpy

Split numpy array into chunks by maxmimum size


This can be done so that the resulting arrays have shapes slightly less than the desired maximum or so that they have exactly the desired maximum except for some remainder at the end.

The basic logic is to compute the parameters for splitting the array and then use array_split to split the array along each axis (or dimension) of the array.

We'll need the numpy and math modules and the example array:

import mathimport numpya = numpy.random.choice([1,2,3,4], (1009,1009))

Slightly less than max

The logic

First store the shape of the final chunk size along each dimension that you want to split it into in a tuple:

chunk_shape = (50, 50)

array_split only splits along one axis (or dimension) or an array at a time. So let's start with just the first axis.

  1. Compute the number of sections we need to split the array into:

    num_sections = math.ceil(a.shape[0] / chunk_shape[0])

    In our example case, this is 21 (1009 / 50 = 20.18).

  2. Now split it:

    first_split = numpy.array_split(a, num_sections, axis=0)

    This gives us a list of 21 (the number of requested sections) numpy arrays that are split so they are no larger than 50 in the first dimension:

    print(len(first_split))# 21print({i.shape for i in first_split})# {(48, 1009), (49, 1009)}# These are the distinct shapes, so we don't see all 21 separately

    In this case, they're 48 and 49 along that axis.

  3. We can do the same thing to each new array for the second dimension:

    num_sections = math.ceil(a.shape[1] / chunk_shape[1])second_split = [numpy.array_split(a2, num_sections, axis=1) for a2 in first_split]

    This gives us a list of lists. Each sublist contains numpy arrays of the size we wanted:

    print(len(second_split))# 21print({len(i) for i in second_split})# {21}# All sublists are 21 longprint({i2.shape for i in second_split for i2 in i})# {(48, 49), (49, 48), (48, 48), (49, 49)}# Distinct shapes

The full function

We can implement this for arbitrary dimensions using a recursive function:

def split_to_approx_shape(a, chunk_shape, start_axis=0):    if len(chunk_shape) != len(a.shape):        raise ValueError('chunk length does not match array number of axes')    if start_axis == len(a.shape):        return a    num_sections = math.ceil(a.shape[start_axis] / chunk_shape[start_axis])    split = numpy.array_split(a, num_sections, axis=start_axis)    return [split_to_approx_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]

We call it like so:

full_split = split_to_approx_shape(a, (50,50))print({i2.shape for i in full_split for i2 in i})# {(48, 49), (49, 48), (48, 48), (49, 49)}# Distinct shapes

Exact shapes plus remainder

The logic

If we want to be a little fancier and have all the new arrays be exactly the specified size except for a trailing leftover array, we can do that by passing a list of indices to split at to array_split.

  1. First build up the array of indices:

    axis = 0split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]

    This gives use a list of indices, each 50 from the last:

    print(split_indices)# [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]
  2. Then split:

    first_split = numpy.array_split(a, split_indices, axis=0)print(len(first_split))# 21print({i.shape for i in first_split})# {(9, 1009), (50, 1009)}# Distinct shapes, so we don't see all 21 separatelyprint((first_split[0].shape, first_split[1].shape, '...', first_split[-2].shape, first_split[-1].shape))# ((50, 1009), (50, 1009), '...', (50, 1009), (9, 1009))
  3. And then again for the second axis:

    axis = 1split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]second_split = [numpy.array_split(a2, split_indices, axis=1) for a2 in first_split]print({i2.shape for i in second_split for i2 in i})# {(9, 50), (9, 9), (50, 9), (50, 50)}

The full function

Adapting the recursive function:

def split_to_shape(a, chunk_shape, start_axis=0):    if len(chunk_shape) != len(a.shape):        raise ValueError('chunk length does not match array number of axes')    if start_axis == len(a.shape):        return a    split_indices = [        chunk_shape[start_axis]*(i+1)        for i in range(math.floor(a.shape[start_axis] / chunk_shape[start_axis]))    ]    split = numpy.array_split(a, split_indices, axis=start_axis)    return [split_to_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]

And we call it exactly the same way:

full_split = split_to_shape(a, (50,50))print({i2.shape for i in full_split for i2 in i})# {(9, 50), (9, 9), (50, 9), (50, 50)}# Distinct shapes

Extra Notes

Performance

These functions seem to be quite fast. I was able to split up my example array (with over 14 billion elements) into 1000 by 1000 shaped pieces (resulting in over 14000 new arrays) in under 0.05 seconds with either function:

print('Building test array')a = numpy.random.randint(4, size=(55000, 250000), dtype='uint8')chunks = (1000, 1000)numtests = 1000print('Running {} tests'.format(numtests))print('split_to_approx_shape: {} seconds'.format(timeit.timeit(lambda: split_to_approx_shape(a, chunks), number=numtests) / numtests))print('split_to_shape: {} seconds'.format(timeit.timeit(lambda: split_to_shape(a, chunks), number=numtests) / numtests))

Output:

Building test arrayRunning 1000 testssplit_to_approx_shape: 0.035109398348040485 secondssplit_to_shape: 0.03113800323300747 seconds

I did not test speed with higher dimension arrays.

Shapes smaller than the max

These functions both work properly if the size of any dimension is less than the specified maximum. This requires no special logic.


As I don't know how your data is generated or will be processed, I can suggest two approaches:

Enable the use of numpy's reshape

Pad the array to allow reshaping it into your chunk dimensions. Simply pad with zeros, such that each (axis_size % chunk_size) == 0. The chunk_size could be different for each axis.

Padding a multidimensional array like this creates a (slightly bigger) copy. To avoid the copy, 'cut out' the biggest chunkable array, reshape it and process the left over borders separately.

Depending on how your data needs to be processed, this could be very impractical.

Use a list comp

I think there are more simple / readable versions of the split implementation. Either using numpy.split() or just fancy indexing.

import numpy as npa = np.arange(1009)chunk_size = 50%timeit np.split(a, range(chunk_size, a.shape[0], chunk_size))%timeit [a[i:i+chunk_size] for i in range(0, a.shape[0], chunk_size)]

shows that the list comp is ~3x as fast while returning the same result:

36.8 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)10.4 µs ± 2.48 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I guess the speedup of the list comprehension should directly translate into higher dimensional arrays. numpy's implementation of array_split does basically that, but additionally allows chunking in arbitrary axes. However the list comp could be extended to do that, too.


By simply using np.array_split and ceiling division we can do this relatively easily.

import numpy as npimport mathmax_size = 15test = np.arrange(101)result = np.array_split(test, (len(test) + (max_size -1) ) // max_size)