Comparison of np.random.choice vs np.random.shuffle for samples without replacement
To answer your questions:
shuffle
seems to be the fastest implementation- It should give the same answers (in fact, it seems to be the same thing under the hood)
Let's start @SvenMarnach's answer here. This is not a dupe of that question, but the answer is useful. Unfortunately that answer doesn't bring it in line with shuffler
timewise:
%timeit shuffler(50, 2)2.47 µs ± 180 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)%timeit chooser(50, 2)52.5 µs ± 3.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) rng = np.random.default_rng()def chooser2(size, num_samples): return rng.choice(size, num_samples, replace=False)%timeit chooser2(50, 2)15.9 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
the random.sample
answer is better:
import random def sampler(size, num_samples): return np.array(random.sample(range(size), num_samples))%timeit sampler(50, 2)4.6 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Still slower, though.
Since I'm not able to parse c
code, I'll take sven's word for it that random.choice
is doing shuffle-and-split in the background, so the methods should be equivalent. Why it's so much faster here is baffling to me though.
EDIT: A sample_indices
based on @DaniMesejo's answer (slightly slower for num_samples = 2
):
def sample_indices(pop, size, num_samples): arr = np.random.rand(pop, size) return np.argpartition(arr, num_samples, axis = 1)[:, :num_samples]
See the source code for numpy.random.choice
; with replace=False
it creates a 50-item temporary list, shuffles that list, and takes two items from that list.
Since version 1.17, the implementation decisions for numpy.random.choice
and numpy.random.shuffle
, as with other numpy.random
functions, can't be changed without affecting backward compatibility (see the recent RNG policy for NumPy). See also these questions:
- Why is random.sample faster than numpy's random.choice?
- Why does numpy.random.choice not use arithmetic coding?
- Does numpy.random.seed() always give the same random number every time?
Compare numpy.random.choice
with numpy.random.Generator.choice
, which is the newer way to sample items in NumPy 1.17 and later. The advantage is that numpy.random.Generator.choice
is not subject to the same compatibility guarantee as numpy.random.choice
or numpy.random.shuffle
. If you care about performance of numpy.random.Generator
you can file an issue in NumPy's GitHub repository.
You could use an alternative solution, the idea is to generate a random array and then find the position of the min and max:
import numpy as npdef sample_indices(ran, size): arr = np.random.rand(ran, size) mi = np.argmin(arr, axis=1).reshape((-1, 1)) ma = np.argmax(arr, axis=1).reshape((-1, 1)) return np.hstack((mi, ma))def shuffler(size, num_samples): items = list(range(size)) np.random.shuffle(items) return items[:num_samples]def chooser(size, num_samples): return np.random.choice(size, num_samples, replace=False)def sample_indices_shuffler(ran, size): return np.array([shuffler(size, 2) for _ in range(ran)])def sample_indices_chooser(ran, size): return np.array([chooser(size, 2) for _ in range(ran)])
Here are the timings:
%timeit sample_indices_chooser(1000, 50)17.3 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit sample_indices_shuffler(1000, 50)2.69 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%timeit sample_indices(1000, 50)553 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For:
res = sample_indices(10, 50)print(res)
Output
[[ 9 6] [31 42] [17 42] [24 45] [ 2 49] [27 31] [21 19] [ 7 16] [20 28] [32 36]]