How to turn Numpy array to set efficiently?
The current state of your question (can change any time): how can I efficiently remove unique elements from a large array of large arrays?
import numpy as nprng = np.random.default_rng()arr = rng.random((3000, 30000))out1 = list(map(np.unique, arr))#orout2 = [np.unique(subarr) for subarr in arr]
Runtimes in an IPython shell:
>>> %timeit list(map(np.unique, arr))5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>>> %timeit [np.unique(subarr) for subarr in arr]5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Update: as @hpaulj pointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:
>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))>>> %timeit list(map(np.unique, arr))4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>>> %timeit [np.unique(subarr) for subarr in arr]4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.
A couple of earlier 'row-wise' unique questions:
vectorize numpy unique for subarrays
Numpy: Row Wise Unique elements
Count unique elements row wise in an ndarray
In a couple of these the count is more interesting than the actual unique values.
If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.