Nested numpy arrays in dask and pandas dataframes

python pandas numpy dask

The data organisation that you have does indeed sound an awful lot like an xarray: multi-dimensional data, with regular coordinates along each of the dimensions and variable properties. xarray allows you to operate on your array in a pandas-like fashion (the docs are very detailed, so I won't go into it). Of note, xarray interfaces directly with Dask so that, as you operate on the high-level data structure, you are actually manipulating dask arrays underneath and so can compute out-of-core and/or distributed.

Although inspired by the netCDF hierarchical data representation (typically stored as HDF5 files), there are a number of possible storage options you could use, including zarr which is particularly useful as a cloud format for parallel access of the form Dask would like to use.

python pandas numpy dask

One (perhaps ugly) way, is to patch pandas and dask parquet API to support multi-dimensional arrays:

# these monkey-patches into the pandas and dask I/O API allow us to save multi-dimensional numpy# arrays# in parquet format by serializing them into byte arraysfrom dask import dataframe as ddimport pandas as pdfrom io import BytesIOdef _patched_pd_read_parquet(*args, **kwargs):    return _orig_pd_read_parquet(*args, **kwargs).applymap(        lambda val: np.load(BytesIO(val)) if isinstance(val, bytes) else val)_orig_pd_read_parquet = pd.io.parquet.PyArrowImpl.readpd.io.parquet.PyArrowImpl.read = _patched_pd_read_parquetdef _serialize_ndarray(arr: np.ndarray) -> bytes:    if isinstance(arr, np.ndarray):        with BytesIO() as buf:            np.save(buf, arr)            return buf.getvalue()    return arrdef _deserialize_ndarray(val: bytes) -> np.ndarray:    return np.load(BytesIO(val)) if isinstance(val, bytes) else valdef _patched_pd_write_parquet(self, df: pd.DataFrame, *args, **kwargs):    return _orig_pd_write_parquet(self, df.applymap(_serialize_ndarray), *args, **kwargs)_orig_pd_write_parquet = pd.io.parquet.PyArrowImpl.writepd.io.parquet.PyArrowImpl.write = _patched_pd_write_parquetdef _patched_dask_read_pyarrow_parquet_piece(*args, **kwargs):    return _orig_dask_read_pyarrow_parquet_piece(*args, **kwargs).applymap(_deserialize_ndarray)_orig_dask_read_pyarrow_parquet_piece = dd.io.parquet._read_pyarrow_parquet_piecedd.io.parquet._read_pyarrow_parquet_piece = _patched_dask_read_pyarrow_parquet_piecedef _patched_dd_write_partition_pyarrow(df: pd.DataFrame, *args, **kwargs):    return _orig_dd_write_partition_pyarrow(df.applymap(_serialize_ndarray), *args, **kwargs)_orig_dd_write_partition_pyarrow = dd.io.parquet._write_partition_pyarrowdd.io.parquet._write_partition_pyarrow = _patched_dd_write_partition_pyarrow

You can then use the tricks specified in the question to get nested arrays in pandas cells (in-memory), and the above will act as a "poor-man's" codec serializing the arrays into byte streams (which different serialization schemes such as parquet can handle)

CodeHunter

Nested numpy arrays in dask and pandas dataframes

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last