Get unique rows of dask array without using dask dataframe

You can always just use numpy.unique:

import dask.array as daimport numpy as npdx = da.random.random((10000, 10000), chunks=(1000, 1000))dx = np.unique(dx, axis=0)

This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique function, but it doesn't support the axis keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array either.

My recommendation would be to just suck it up for now and convert to dask.dataframe. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.

Edit

I initially thought there might be a simple hack that could be used to implement the axis parameter for dask.array.unique. However, the blob type trick that numpy.unqiue uses to implement its own axis keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.

So no clever worakaround for now. Just use dask.dataframe.

CodeHunter

Get unique rows of dask array without using dask dataframe

Edit

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last