Get unique rows of dask array without using dask dataframe Get unique rows of dask array without using dask dataframe numpy numpy

Get unique rows of dask array without using dask dataframe


You can always just use numpy.unique:

import dask.array as daimport numpy as npdx = da.random.random((10000, 10000), chunks=(1000, 1000))dx = np.unique(dx, axis=0)

This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique function, but it doesn't support the axis keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array either.

My recommendation would be to just suck it up for now and convert to dask.dataframe. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.

Edit

I initially thought there might be a simple hack that could be used to implement the axis parameter for dask.array.unique. However, the blob type trick that numpy.unqiue uses to implement its own axis keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.

So no clever worakaround for now. Just use dask.dataframe.