Cython inline function with numpy array as parameter

python performance numpy inline cython

More than 3 years have passed since the question was posted and there have been great progress in the meantime. On this code (Update 2 of the question):

# cython: infer_types=True# cython: boundscheck=False# cython: wraparound=Falseimport numpy as npcimport numpy as npcdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):    arr[i, j]+= 1def test1(np.ndarray[np.int32_t, ndim=2] arr):    cdef int i,j        for i in xrange(arr.shape[0]):        for j in xrange(arr.shape[1]):            inc(arr, i, j)def test2(np.ndarray[np.int32_t, ndim=2] arr):        cdef int i,j        for i in xrange(arr.shape[0]):        for j in xrange(arr.shape[1]):            arr[i,j] += 1

I get the following timings:

arr = np.zeros((1000,1000), dtype=np.int32)%timeit test1(arr)%timeit test2(arr)   1 loops, best of 3: 354 ms per loop1000 loops, best of 3: 1.02 ms per loop

So the problem is reproducible even after more than 3 years. Cython now has typed memoryviews, AFAIK it was introduced in Cython 0.16, so not available at the time the question was posted. With this:

# cython: infer_types=True# cython: boundscheck=False# cython: wraparound=Falseimport numpy as npcimport numpy as npcdef inline inc(int[:, ::1] tmv, int i, int j):    tmv[i, j]+= 1def test3(np.ndarray[np.int32_t, ndim=2] arr):    cdef int i,j    cdef int[:, ::1] tmv = arr    for i in xrange(tmv.shape[0]):        for j in xrange(tmv.shape[1]):            inc(tmv, i, j)def test4(np.ndarray[np.int32_t, ndim=2] arr):        cdef int i,j    cdef int[:, ::1] tmv = arr    for i in xrange(tmv.shape[0]):        for j in xrange(tmv.shape[1]):            tmv[i,j] += 1

With this I get:

arr = np.zeros((1000,1000), dtype=np.int32)%timeit test3(arr)%timeit test4(arr)1000 loops, best of 3: 977 µs per loop1000 loops, best of 3: 838 µs per loop

We are almost there and already faster than the old-fashioned way! Now, the inc() function is eligible to be declared nogil, so let's declare it so! But oops:

Error compiling Cython file:[...]cdef inline inc(int[:, ::1] tmv, int i, int j) nogil:    ^[...]Function with Python return type cannot be declared nogil

Aaah, I totally missed that the void return type was missing! Once again but now with void:

cdef inline void inc(int[:, ::1] tmv, int i, int j) nogil:    tmv[i, j]+= 1

And finally I get:

%timeit test3(arr)%timeit test4(arr)1000 loops, best of 3: 843 µs per loop1000 loops, best of 3: 853 µs per loop

As fast as manual inlining!

Now, just for fun, I tried Numba on this code:

import numpy as npfrom numba import autojit, jit@autojitdef inc(arr, i, j):    arr[i, j] += 1@autojitdef test5(arr):    for i in xrange(arr.shape[0]):        for j in xrange(arr.shape[1]):            inc(arr, i, j)

I get:

arr = np.zeros((1000,1000), dtype=np.int32)%timeit test5(arr)100 loops, best of 3: 4.03 ms per loop

Even though it's 4.7x slower than Cython, most likely because the JIT compiler failed to inline inc(), I think it is AWESOME! All I needed to do is to add @autojit and didn't have to mess up the code with clumsy type declarations; 88x speedup for next to nothing!

I have tried other things with Numba, such as

@jit('void(i4[:],i4,i4)')def inc(arr, i, j):    arr[i, j] += 1

or nopython=True but failed to improve it any further.

Improving inlining is on the Numba developers' list, we only need to file more requests to make it have higher priority. ;)

python performance numpy inline cython

The problem is that assigning a numpy array (or, equivalently, passing it in as a function argument) is not just a simple assignment, but a "buffer extraction" which populates a struct and pulls out the stride and pointer information into local variables needed for fast indexing. If you're iterating over a moderate number of elements, this O(1) overhead is easily amortized over the loop, but that is certainly not the case for small functions.

Improving this is high on many people's wishlist, but it's a non-trivial change. See, e.g., the discussion at http://groups.google.com/group/cython-users/browse_thread/thread/8fc8686315d7f3fe

python performance numpy inline cython

You are passing the array to inc() as a Python object of type numpy.ndarray. Passing Python objects is expensive due to issues like reference counting, and it seems to prevent inlining. If you pass the array the C way, i.e. as a pointer, test1() becomes even faster than test2() on my machine:

cimport numpy as npcdef inline inc(int* arr, int i):    arr[i] += 1def test1(np.ndarray[np.int32_t] arr):    cdef int i    for i in xrange(len(arr)):        inc(<int*>arr.data, i)

CodeHunter

Cython inline function with numpy array as parameter

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last