A fast way to find the largest N elements in an numpy array

python sorting numpy

numpy 1.8 implements partition and argpartition that perform partial sort ( in O(n) time as opposed to full sort that is O(n) * log(n)).

import numpy as nptest = np.array([9,1,3,4,8,7,2,5,6,0])temp = np.argpartition(-test, 4)result_args = temp[:4]temp = np.partition(-test, 4)result = -temp[:4]

Result:

>>> result_argsarray([0, 4, 8, 5]) # indices of highest vals>>> resultarray([9, 8, 6, 7]) # highest vals

Timing:

In [16]: a = np.arange(10000)In [17]: np.random.shuffle(a)In [18]: %timeit np.argsort(a)1000 loops, best of 3: 1.02 ms per loopIn [19]: %timeit np.argpartition(a, 100)10000 loops, best of 3: 139 us per loopIn [20]: %timeit np.argpartition(a, 1000)10000 loops, best of 3: 141 us per loop

python sorting numpy

The bottleneck module has a fast partial sort method that works directly with Numpy arrays: bottleneck.partition().

Note that bottleneck.partition() returns the actual values sorted, if you want the indexes of the sorted values (what numpy.argsort() returns) you should use bottleneck.argpartition().

I've benchmarked:

z = -bottleneck.partition(-a, 10)[:10]
z = a.argsort()[-10:]
z = heapq.nlargest(10, a)

where a is a random 1,000,000-element array.

The timings were as follows:

bottleneck.partition(): 25.6 ms per loop
np.argsort(): 198 ms per loop
heapq.nlargest(): 358 ms per loop

python sorting numpy

I had this problem and, since this question is 5 years old, I had to redo all benchmarks and change the syntax of bottleneck (there is no partsort anymore, it's partition now).

I used the same arguments as kwgoodman, except the number of elements retrieved, which I increased to 50 (to better fit my particular situation).

I got these results:

bottleneck 1: 01.12 ms per loopbottleneck 2: 00.95 ms per looppandas      : 01.65 ms per loopheapq       : 08.61 ms per loopnumpy       : 12.37 ms per loopnumpy 2     : 00.95 ms per loop

So, bottleneck_2 and numpy_2 (adas's solution) were tied.But, using np.percentile (numpy_2) you have those topN elements already sorted, which is not the case for the other solutions. On the other hand, if you are also interested on the indexes of those elements, percentile is not useful.

I added pandas too, which uses bottleneck underneath, if available (http://pandas.pydata.org/pandas-docs/stable/install.html#recommended-dependencies). If you already have a pandas Series or DataFrame to start with, you are in good hands, just use nlargest and you're done.

The code used for the benchmark is as follows (python 3, please):

import timeimport numpy as npimport bottleneck as bnimport pandas as pdimport heapqdef bottleneck_1(a, n):    return -bn.partition(-a, n)[:n]def bottleneck_2(a, n):    return bn.partition(a, a.size-n)[-n:]def numpy(a, n):    return a[a.argsort()[-n:]]def numpy_2(a, n):    M = a.shape[0]    perc = (np.arange(M-n,M)+1.0)/M*100    return np.percentile(a,perc)def pandas(a, n):    return pd.Series(a).nlargest(n)def hpq(a, n):    return heapq.nlargest(n, a)def do_nothing(a, n):    return a[:n]def benchmark(func, size=1000000, ntimes=100, topn=50):    t1 = time.time()    for n in range(ntimes):        a = np.random.rand(size)        func(a, topn)    t2 = time.time()    ms_per_loop = 1000000 * (t2 - t1) / size    return ms_per_loopt1 = benchmark(bottleneck_1)t2 = benchmark(bottleneck_2)t3 = benchmark(pandas)t4 = benchmark(hpq)t5 = benchmark(numpy)t6 = benchmark(numpy_2)t0 = benchmark(do_nothing)print("bottleneck 1: {:05.2f} ms per loop".format(t1 - t0))print("bottleneck 2: {:05.2f} ms per loop".format(t2 - t0))print("pandas      : {:05.2f} ms per loop".format(t3 - t0))print("heapq       : {:05.2f} ms per loop".format(t4 - t0))print("numpy       : {:05.2f} ms per loop".format(t5 - t0))print("numpy 2     : {:05.2f} ms per loop".format(t6 - t0))

CodeHunter

A fast way to find the largest N elements in an numpy array

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last