Fastest way to grow a numpy numeric array Fastest way to grow a numpy numeric array numpy numpy

Fastest way to grow a numpy numeric array


I tried a few different things, with timing.

import numpy as np
  1. The method you mention as slow: (32.094 seconds)

    class A:    def __init__(self):        self.data = np.array([])    def update(self, row):        self.data = np.append(self.data, row)    def finalize(self):        return np.reshape(self.data, newshape=(self.data.shape[0]/5, 5))
  2. Regular ol Python list: (0.308 seconds)

    class B:    def __init__(self):        self.data = []    def update(self, row):        for r in row:            self.data.append(r)    def finalize(self):        return np.reshape(self.data, newshape=(len(self.data)/5, 5))
  3. Trying to implement an arraylist in numpy: (0.362 seconds)

    class C:    def __init__(self):        self.data = np.zeros((100,))        self.capacity = 100        self.size = 0    def update(self, row):        for r in row:            self.add(r)    def add(self, x):        if self.size == self.capacity:            self.capacity *= 4            newdata = np.zeros((self.capacity,))            newdata[:self.size] = self.data            self.data = newdata        self.data[self.size] = x        self.size += 1    def finalize(self):        data = self.data[:self.size]        return np.reshape(data, newshape=(len(data)/5, 5))

And this is how I timed it:

x = C()for i in xrange(100000):    x.update([i])

So it looks like regular old Python lists are pretty good ;)


np.append() copy all the data in the array every time, but list grow the capacity by a factor (1.125). list is fast, but memory usage is larger than array. You can use array module of the python standard library if you care about the memory.

Here is a discussion about this topic:

How to create a dynamic array


Using the class declarations in Owen's post, here is a revised timing with some effect of the finalize.

In short, I find class C to provide an implementation that is over 60x faster than the method in the original post. (apologies for the wall of text)

The file I used:

#!/usr/bin/pythonimport cProfileimport numpy as np# ... class declarations here ...def test_class(f):    x = f()    for i in xrange(100000):        x.update([i])    for i in xrange(1000):        x.finalize()for x in 'ABC':    cProfile.run('test_class(%s)' % x)

Now, the resulting timings:

A:

     903005 function calls in 16.049 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)     1    0.000    0.000   16.049   16.049 <string>:1(<module>)100000    0.139    0.000    1.888    0.000 fromnumeric.py:1043(ravel)  1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)100000    0.322    0.000   14.424    0.000 function_base.py:3466(append)100000    0.102    0.000    1.623    0.000 numeric.py:216(asarray)100000    0.121    0.000    0.298    0.000 numeric.py:286(asanyarray)  1000    0.002    0.000    0.004    0.000 test.py:12(finalize)     1    0.146    0.146   16.049   16.049 test.py:50(test_class)     1    0.000    0.000    0.000    0.000 test.py:6(__init__)100000    1.475    0.000   15.899    0.000 test.py:9(update)     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}100000    0.126    0.000    0.126    0.000 {method 'ravel' of 'numpy.ndarray' objects}  1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}200001    1.698    0.000    1.698    0.000 {numpy.core.multiarray.array}100000   11.915    0.000   11.915    0.000 {numpy.core.multiarray.concatenate}

B:

     208004 function calls in 16.885 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)     1    0.001    0.001   16.885   16.885 <string>:1(<module>)  1000    0.025    0.000   16.508    0.017 fromnumeric.py:107(reshape)  1000    0.013    0.000   16.483    0.016 fromnumeric.py:32(_wrapit)  1000    0.007    0.000   16.445    0.016 numeric.py:216(asarray)     1    0.000    0.000    0.000    0.000 test.py:16(__init__)100000    0.068    0.000    0.080    0.000 test.py:19(update)  1000    0.012    0.000   16.520    0.017 test.py:23(finalize)     1    0.284    0.284   16.883   16.883 test.py:50(test_class)  1000    0.005    0.000    0.005    0.000 {getattr}  1000    0.001    0.000    0.001    0.000 {len}100000    0.012    0.000    0.012    0.000 {method 'append' of 'list' objects}     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}  1000    0.020    0.000    0.020    0.000 {method 'reshape' of 'numpy.ndarray' objects}  1000   16.438    0.016   16.438    0.016 {numpy.core.multiarray.array}

C:

     204010 function calls in 0.244 secondsOrdered by: standard namencalls  tottime  percall  cumtime  percall filename:lineno(function)     1    0.000    0.000    0.244    0.244 <string>:1(<module>)  1000    0.001    0.000    0.003    0.000 fromnumeric.py:107(reshape)     1    0.000    0.000    0.000    0.000 test.py:27(__init__)100000    0.082    0.000    0.170    0.000 test.py:32(update)100000    0.087    0.000    0.088    0.000 test.py:36(add)  1000    0.002    0.000    0.005    0.000 test.py:46(finalize)     1    0.068    0.068    0.243    0.243 test.py:50(test_class)  1000    0.000    0.000    0.000    0.000 {len}     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}  1000    0.002    0.000    0.002    0.000 {method 'reshape' of 'numpy.ndarray' objects}     6    0.001    0.000    0.001    0.000 {numpy.core.multiarray.zeros}

Class A is destroyed by the updates, class B is destroyed by the finalizes. Class C is robust in the face of both of them.