Using strides for an efficient moving average filter Using strides for an efficient moving average filter numpy numpy

Using strides for an efficient moving average filter


For what it's worth, here's how you'd do it using "fancy" striding tricks. I was going to post this yesterday, but got distracted by actual work! :)

@Paul & @eat both have nice implementations using various other ways of doing this. Just to continue things from the earlier question, I figured I'd post the N-dimensional equivalent.

You're not going to be able to significantly beat scipy.ndimage functions for >1D arrays, however. (scipy.ndimage.uniform_filter should beat scipy.ndimage.convolve, though)

Moreover, if you're trying to get a multidimensional moving window, you risk having memory usage blow up whenever you inadvertently make a copy of your array. While the initial "rolling" array is just a view into the memory of your original array, any intermediate steps that copy the array will make a copy that is orders of magnitude larger than your original array (i.e. Let's say that you're working with a 100x100 original array... The view into it (for a filter size of (3,3)) will be 98x98x3x3 but use the same memory as the original. However, any copies will use the amount of memory that a full 98x98x3x3 array would!!)

Basically, using crazy striding tricks is great for when you want to vectorize moving window operations on a single axis of an ndarray. It makes it really easy to calculate things like a moving standard deviation, etc with very little overhead. When you want to start doing this along multiple axes, it's possible, but you're usually better off with more specialized functions. (Such as scipy.ndimage, etc)

At any rate, here's how you do it:

import numpy as npdef rolling_window_lastaxis(a, window):    """Directly taken from Erik Rigtorp's post to numpy-discussion.    <http://www.mail-archive.com/numpy-discussion@scipy.org/msg29450.html>"""    if window < 1:       raise ValueError, "`window` must be at least 1."    if window > a.shape[-1]:       raise ValueError, "`window` is too long."    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)    strides = a.strides + (a.strides[-1],)    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)def rolling_window(a, window):    if not hasattr(window, '__iter__'):        return rolling_window_lastaxis(a, window)    for i, win in enumerate(window):        if win > 1:            a = a.swapaxes(i, -1)            a = rolling_window_lastaxis(a, win)            a = a.swapaxes(-2, i)    return afiltsize = (3, 3)a = np.zeros((10,10), dtype=np.float)a[5:7,5] = 1b = rolling_window(a, filtsize)blurred = b.mean(axis=-1).mean(axis=-1)

So what we get when we do b = rolling_window(a, filtsize) is an 8x8x3x3 array, that's actually a view into the same memory as the original 10x10 array. We could have just as easily used different filter size along different axes or operated only along selected axes of an N-dimensional array (i.e. filtsize = (0,3,0,3) on a 4-dimensional array would give us a 6 dimensional view).

We can then apply an arbitrary function to the last axis repeatedly to effectively calculate things in a moving window.

However, because we're storing temporary arrays that are much bigger than our original array on each step of mean (or std or whatever), this is not at all memory efficient! It's also not going to be terribly fast, either.

The equivalent for ndimage is just:

blurred = scipy.ndimage.uniform_filter(a, filtsize, output=a)

This will handle a variety of boundary conditions, do the "blurring" in-place without requiring a temporary copy of the array, and be very fast. Striding tricks are a good way to apply a function to a moving window along one axis, but they're not a good way to do it along multiple axes, usually....

Just my $0.02, at any rate...


I'm not familiar enough with Python to write out code for that, but the two best ways to speed up convolutions is to either separate the filter or to use the Fourier transform.

Separated filter : Convolution is O(M*N), where M and N are number of pixels in the image and the filter, respectively. Since average filtering with a 3-by-3 kernel is equivalent to filtering first with a 3-by-1 kernel and then a 1-by-3 kernel, you can get (3+3)/(3*3) = ~30% speed improvement by consecutive convolution with two 1-d kernels (this obviously gets better as the kernel gets larger). You may still be able to use stride tricks here, of course.

Fourier Transform : conv(A,B) is equivalent to ifft(fft(A)*fft(B)), i.e. a convolution in direct space becomes a multiplication in Fourier space, where A is your image and B is your filter. Since the (element-wise) multiplication of the Fourier transforms requires that A and B are the same size, B is an array of size(A) with your kernel at the very center of the image and zeros everywhere else. To place a 3-by-3 kernel at the center of an array, you may have to pad A to odd size. Depending on your implementation of the Fourier transform, this can be a lot faster than the convolution (and if you apply the same filter multiple times, you can pre-compute fft(B), saving another 30% of computation time).


Lets see:

It's not so clear form your question, but I'm assuming now that you'll like to improve significantly this kind of averaging.

import numpy as npfrom numpy.lib import stride_tricks as stdef mf(A, k_shape= (3, 3)):    m= A.shape[0]- 2    n= A.shape[1]- 2    strides= A.strides+ A.strides    new_shape= (m, n, k_shape[0], k_shape[1])    A= st.as_strided(A, shape= new_shape, strides= strides)    return np.sum(np.sum(A, -1), -1)/ np.prod(k_shape)if __name__ == '__main__':    A= np.arange(100).reshape((10, 10))    print mf(A)

Now, what kind of performance improvements you would actually expect?

Update:
First of all, a warning: the code in it's current state does not adapt properly to the 'kernel' shape. However that's not my primary concern right now (anyway the idea is there allready how to adapt properly).

I have just chosen the new shape of a 4D A intuitively, for me it really make sense to think about a 2D 'kernel' center to be centered to each grid position of original 2D A.

But that 4D shaping may not actually be the 'best' one. I think the real problem here is the performance of summing. One should to be able to find 'best order' (of the 4D A) inorder to fully utilize your machines cache architecture. However that order may not be the same for 'small' arrays which kind of 'co-operates' with your machines cache and those larger ones, which don't (at least not so straightforward manner).

Update 2:
Here is a slightly modified version of mf. Clearly it's better to reshape to a 3D array first and then instead of summing just do dot product (this has the advantage all so, that kernel can be arbitrary). However it's still some 3x slower (on my machine) than Pauls updated function.

def mf(A):    k_shape= (3, 3)    k= np.prod(k_shape)    m= A.shape[0]- 2    n= A.shape[1]- 2    strides= A.strides* 2    new_shape= (m, n)+ k_shape    A= st.as_strided(A, shape= new_shape, strides= strides)    w= np.ones(k)/ k    return np.dot(A.reshape((m, n, -1)), w)