Numpy vs Cython speed Numpy vs Cython speed numpy numpy

Numpy vs Cython speed


With slight modification, version 3 becomes twice as fast:

@cython.boundscheck(False)@cython.wraparound(False)@cython.nonecheck(False)def process2(np.ndarray[DTYPE_t, ndim=2] array):    cdef unsigned int rows = array.shape[0]    cdef unsigned int cols = array.shape[1]    cdef unsigned int row, col, row2    cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))    for row in range(rows):        for row2 in range(rows):            for col in range(cols):                out[row, col] += array[row2, col] - array[row, col]    return out

The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.

If you need to use this function on small input arrays then you can reduce the overhead by using np.empty instead of np.ones. To reduce the overhead further use PyArray_EMPTY from the numpy C API.

If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range function) will overflow. To be safe use:

cdef Py_ssize_t rows = array.shape[0]cdef Py_ssize_t cols = array.shape[1]cdef Py_ssize_t row, col, row2

instead of

cdef unsigned int rows = array.shape[0]cdef unsigned int cols = array.shape[1]cdef unsigned int row, col, row2

Timing:

In [2]: a = np.random.rand(10000, 10)In [3]: timeit process(a)1 loops, best of 3: 3.53 s per loopIn [4]: timeit process2(a)1 loops, best of 3: 1.84 s per loop

where process is your version 3.


As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this

  • First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays

  • Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.

Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.

As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.


I would recommend using the -a flag to have cython generate the html file that shows what is being translated into pure c vs calling the python API:

http://docs.cython.org/src/quickstart/cythonize.html

Version 2 gives nearly the same result as Version 1, because all of the heavy lifting is being done by the Python API (via numpy) and cython isn't doing anything for you. In fact on my machine, numpy is built against MKL, so when I compile the cython generated c code using gcc, Version 3 is actually a little slower than the other two.

Cython shines when you are doing an array manipulation that numpy can't do in a 'vectorized' way, or when you are doing something memory intensive that it allows you to avoid creating a large temporary array. I've gotten 115x speed-ups using cython vs numpy for some of my own code:

https://github.com/synapticarbors/pylangevin-integrator

Part of that was calling randomkit directory at the level of the c code instead of calling it through numpy.random, but most of that was cython translating the computationally intensive for loops into pure c without calls to python.