Nested numpy arrays in dask and pandas dataframes Nested numpy arrays in dask and pandas dataframes numpy numpy

Nested numpy arrays in dask and pandas dataframes


The data organisation that you have does indeed sound an awful lot like an xarray: multi-dimensional data, with regular coordinates along each of the dimensions and variable properties. xarray allows you to operate on your array in a pandas-like fashion (the docs are very detailed, so I won't go into it). Of note, xarray interfaces directly with Dask so that, as you operate on the high-level data structure, you are actually manipulating dask arrays underneath and so can compute out-of-core and/or distributed.

Although inspired by the netCDF hierarchical data representation (typically stored as HDF5 files), there are a number of possible storage options you could use, including zarr which is particularly useful as a cloud format for parallel access of the form Dask would like to use.


One (perhaps ugly) way, is to patch pandas and dask parquet API to support multi-dimensional arrays:

# these monkey-patches into the pandas and dask I/O API allow us to save multi-dimensional numpy# arrays# in parquet format by serializing them into byte arraysfrom dask import dataframe as ddimport pandas as pdfrom io import BytesIOdef _patched_pd_read_parquet(*args, **kwargs):    return _orig_pd_read_parquet(*args, **kwargs).applymap(        lambda val: np.load(BytesIO(val)) if isinstance(val, bytes) else val)_orig_pd_read_parquet = pd.io.parquet.PyArrowImpl.readpd.io.parquet.PyArrowImpl.read = _patched_pd_read_parquetdef _serialize_ndarray(arr: np.ndarray) -> bytes:    if isinstance(arr, np.ndarray):        with BytesIO() as buf:            np.save(buf, arr)            return buf.getvalue()    return arrdef _deserialize_ndarray(val: bytes) -> np.ndarray:    return np.load(BytesIO(val)) if isinstance(val, bytes) else valdef _patched_pd_write_parquet(self, df: pd.DataFrame, *args, **kwargs):    return _orig_pd_write_parquet(self, df.applymap(_serialize_ndarray), *args, **kwargs)_orig_pd_write_parquet = pd.io.parquet.PyArrowImpl.writepd.io.parquet.PyArrowImpl.write = _patched_pd_write_parquetdef _patched_dask_read_pyarrow_parquet_piece(*args, **kwargs):    return _orig_dask_read_pyarrow_parquet_piece(*args, **kwargs).applymap(_deserialize_ndarray)_orig_dask_read_pyarrow_parquet_piece = dd.io.parquet._read_pyarrow_parquet_piecedd.io.parquet._read_pyarrow_parquet_piece = _patched_dask_read_pyarrow_parquet_piecedef _patched_dd_write_partition_pyarrow(df: pd.DataFrame, *args, **kwargs):    return _orig_dd_write_partition_pyarrow(df.applymap(_serialize_ndarray), *args, **kwargs)_orig_dd_write_partition_pyarrow = dd.io.parquet._write_partition_pyarrowdd.io.parquet._write_partition_pyarrow = _patched_dd_write_partition_pyarrow

You can then use the tricks specified in the question to get nested arrays in pandas cells (in-memory), and the above will act as a "poor-man's" codec serializing the arrays into byte streams (which different serialization schemes such as parquet can handle)