Memory consumption of NumPy function for standard deviation Memory consumption of NumPy function for standard deviation numpy numpy

Memory consumption of NumPy function for standard deviation


I doubt you will find any such functions in numpy. The raison d'être of numpy is that it takes advantage of vector processor instruction sets -- performing the same instruction of large amounts of data. Basically numpy trades memory efficiency for speed efficiency. However, due to the memory intensive nature of Python, numpy is also able to achieve certain memory efficiencies by associating the data type with the array as a whole and not each individual element.

One way to improve the speed, but still sacrifice some memory overhead is calculate the standard deviation in chunks eg.

import numpy as npdef std(arr, blocksize=1000000):    """Written for py3, change range to xrange for py2.    This implementation requires the entire array in memory, but it shows how you can    calculate the standard deviation in a piecemeal way.    """    num_blocks, remainder = divmod(len(arr), blocksize)    mean = arr.mean()    tmp = np.empty(blocksize, dtype=float)    total_squares = 0    for start in range(0, blocksize*num_blocks, blocksize):        # get a view of the data we want -- views do not "own" the data they point to        # -- they have minimal memory overhead        view = arr[start:start+blocksize]        # # inplace operations prevent a new array from being created        np.subtract(view, mean, out=tmp)        tmp *= tmp        total_squares += tmp.sum()    if remainder:        # len(arr) % blocksize != 0 and need process last part of array        # create copy of view, with the smallest amount of new memory allocation possible        # -- one more array *view*        view = arr[-remainder:]        tmp = tmp[-remainder:]        np.subtract(view, mean, out=tmp)        tmp *= tmp        total_squares += tmp.sum()    var = total_squares / len(arr)    sd = var ** 0.5    return sda = np.arange(20e6)assert np.isclose(np.std(a), std(a))

Showing the speed up --- the larger the blocksize, the larger the speed up. And considerably lower memory overhead. Not entirely the lower memory overhead is 100% accurate.

In [70]: %timeit np.std(a)10 loops, best of 3: 105 ms per loopIn [71]: %timeit std(a, blocksize=4096)10 loops, best of 3: 160 ms per loopIn [72]: %timeit std(a, blocksize=1000000)10 loops, best of 3: 105 ms per loopIn [73]: %memit std(a, blocksize=4096)peak memory: 360.11 MiB, increment: 0.00 MiBIn [74]: %memit std(a, blocksize=1000000)peak memory: 360.11 MiB, increment: 0.00 MiBIn [75]: %memit np.std(a)peak memory: 512.70 MiB, increment: 152.59 MiB


Cython to the rescue! This achieves a nice speed up:

%%cythoncimport cythoncimport numpy as npfrom libc.math cimport sqrt@cython.boundscheck(False)def std_welford(np.ndarray[np.float64_t, ndim=1] a):    cdef int n = 0    cdef float mean = 0    cdef float M2 = 0    cdef int a_len = len(a)    cdef int i    cdef float delta    cdef float result    for i in range(a_len):        n += 1        delta = a[i] - mean        mean += delta / n        M2 += delta * (a[i] - mean)    if n < 2:        result = np.nan        return result    else:        result = sqrt(M2 / (n - 1))        return result

Using this to test:

a = np.random.rand(10000).astype(np.float)print std_welford(a)%timeit -n 10 -r 10 std_welford(a)

Cython code

0.28832745552110 loops, best of 10: 59.6 µs per loop

Original code

0.28960561739710 loops, best of 10: 18.5 ms per loop

Numpy std

0.28949322350410 loops, best of 10: 29.3 µs per loop

So a speed increase of around 300x. Still not as good as the numpy version..