Parallel in-place sort for numpy arrays
I ended up wrapping GCC parallel sort. Here is the code:
parallelSort.pyx
# cython: wraparound = False# cython: boundscheck = Falseimport numpy as npcimport numpy as npimport cythoncimport cython ctypedef fused real: cython.char cython.uchar cython.short cython.ushort cython.int cython.uint cython.long cython.ulong cython.longlong cython.ulonglong cython.float cython.doublecdef extern from "<parallel/algorithm>" namespace "__gnu_parallel": cdef void sort[T](T first, T last) nogil def numpyParallelSort(real[:] a): "In-place parallel sort for numpy types" sort(&a[0], &a[a.shape[0]])
Extra compiler args: -fopenmp (compile) and -lgomp (linking)
This makefile will do it:
all: cython --cplus parallelSort.pyx g++ -g -march=native -Ofast -fpic -c parallelSort.cpp -o parallelSort.o -fopenmp `python-config --includes` g++ -g -march=native -Ofast -shared -o parallelSort.so parallelSort.o `python-config --libs` -lgomp clean: rm -f parallelSort.cpp *.o *.so
And this shows that it works:
from parallelSort import numpyParallelSortimport numpy as np a = np.random.random(100000000)numpyParallelSort(a) print a[:10]
edit: fixed bug noticed in the comment below
Mergesort parallelizes quite naturally. Just have each worker pre-sort an arbitrary chunk, and then run a single merge pass on it. The final merging should require only O(N) operations, and its trivial to write a function for doing so in numba or somesuch.