Numpy sort ndarray on multiple columns
numpy ndarray sort by the 1st, 2nd or 3rd column:
>>> a = np.array([[1,30,200], [2,20,300], [3,10,100]])>>> aarray([[ 1, 30, 200], [ 2, 20, 300], [ 3, 10, 100]])>>> a[a[:,2].argsort()] #sort by the 3rd column ascendingarray([[ 3, 10, 100], [ 1, 30, 200], [ 2, 20, 300]])>>> a[a[:,2].argsort()][::-1] #sort by the 3rd column descendingarray([[ 2, 20, 300], [ 1, 30, 200], [ 3, 10, 100]])>>> a[a[:,1].argsort()] #sort by the 2nd column ascendingarray([[ 3, 10, 100], [ 2, 20, 300], [ 1, 30, 200]])
To explain what is going on here: argsort()
is passing back an array containing integer sequence of its parent:https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
>>> x = np.array([15, 30, 4, 80, 6])>>> np.argsort(x)array([2, 4, 0, 1, 3])
Sort by column 3, then by column 2 then 1:
>>> a = np.array([[2,30,200], [1,30,200], [1,10,200]])>>> aarray([[ 2, 30, 200], [ 1, 30, 200], [ 1, 10, 200]])>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))]array([[ 1, 10, 200], [ 1, 30, 200], [ 2, 30, 200]])>>> a[np.lexsort((a[:,2], a[:,1],a[:,0]))][::-1] #reversearray([[ 2 30 200] [ 1 30 200] [ 1 10 200]])
Import letting Numpy guess the type and sorting in place:
import numpy as np# let numpy guess the type with dtype=Nonemy_data = np.genfromtxt(infile, dtype=None, names=["a", "b", "c", "d"])# access columns by nameprint(my_data["b"]) # column 1# sort column 1 and column 0 my_data.sort(order=["b", "a"])# save specifying required format (tab separated values)np.savetxt("sorted.tsv", my_data, fmt="%d\t%d\t%.6f\t%.6f"
Alternatively, specifying the input format and sorting to a new array:
import numpy as np# tell numpy the first 2 columns are int and the last 2 are floatsmy_data = np.genfromtxt(infile, dtype=[('a', '<i8'), ('b', '<i8'), ('x', '<f8'), ('d', '<f8')])# access columns by nameprint(my_data["b"]) # column 1# get the indices to sort the array using lexsort# the last element of the tuple (column 1) is used as the primary keyind = np.lexsort((my_data["a"], my_data["b"]))# create a new, sorted arraysorted_data = my_data[ind]# save specifying required format (tab separated values)np.savetxt("sorted.tsv", sorted_data, fmt="%d\t%d\t%.6f\t%.6f")
Output:
2 1 2.000000 0.0000003 1 2.000000 0.0000004 1 2.000000 0.0000002 2 100.000000 0.0000003 2 4.000000 0.0000004 2 4.000000 0.0000002 3 100.000000 0.0000003 3 6.000000 0.0000004 3 6.000000 0.000000
With np.lexsort
you can sort based on several columns simultaneously. The columns that you want to sort by need to be passed in reverse. That means np.lexsort((col_b,col_a))
first sorts by col_a, and then by col_b:
my_data = np.array([[ 2., 1., 2., 0.], [ 2., 2., 100., 0.], [ 2., 3., 100., 0.], [ 3., 1., 2., 0.], [ 3., 2., 4., 0.], [ 3., 3., 6., 0.], [ 4., 1., 2., 0.], [ 4., 2., 4., 0.], [ 4., 3., 6., 0.]])ind = np.lexsort((my_data[:,0],my_data[:,1]))my_data[ind]
result:
array([[ 2., 1., 2., 0.], [ 3., 1., 2., 0.], [ 4., 1., 2., 0.], [ 2., 2., 100., 0.], [ 3., 2., 4., 0.], [ 4., 2., 4., 0.], [ 2., 3., 100., 0.], [ 3., 3., 6., 0.], [ 4., 3., 6., 0.]])
If you know that your first column is already sorted, you can use:
ind = my_data[:,1].argsort(kind='stable')my_data[ind]
This makes sure that order is preserved for equal items. The quick sort algorithm that is generally used does not do that, though it is faster.