Nested numpy arrays in dask and pandas dataframes
The data organisation that you have does indeed sound an awful lot like an xarray: multi-dimensional data, with regular coordinates along each of the dimensions and variable properties. xarray allows you to operate on your array in a pandas-like fashion (the docs are very detailed, so I won't go into it). Of note, xarray interfaces directly with Dask so that, as you operate on the high-level data structure, you are actually manipulating dask arrays underneath and so can compute out-of-core and/or distributed.
Although inspired by the netCDF hierarchical data representation (typically stored as HDF5 files), there are a number of possible storage options you could use, including zarr which is particularly useful as a cloud format for parallel access of the form Dask would like to use.
One (perhaps ugly) way, is to patch pandas and dask parquet API to support multi-dimensional arrays:
# these monkey-patches into the pandas and dask I/O API allow us to save multi-dimensional numpy# arrays# in parquet format by serializing them into byte arraysfrom dask import dataframe as ddimport pandas as pdfrom io import BytesIOdef _patched_pd_read_parquet(*args, **kwargs): return _orig_pd_read_parquet(*args, **kwargs).applymap( lambda val: np.load(BytesIO(val)) if isinstance(val, bytes) else val)_orig_pd_read_parquet = pd.io.parquet.PyArrowImpl.readpd.io.parquet.PyArrowImpl.read = _patched_pd_read_parquetdef _serialize_ndarray(arr: np.ndarray) -> bytes: if isinstance(arr, np.ndarray): with BytesIO() as buf: np.save(buf, arr) return buf.getvalue() return arrdef _deserialize_ndarray(val: bytes) -> np.ndarray: return np.load(BytesIO(val)) if isinstance(val, bytes) else valdef _patched_pd_write_parquet(self, df: pd.DataFrame, *args, **kwargs): return _orig_pd_write_parquet(self, df.applymap(_serialize_ndarray), *args, **kwargs)_orig_pd_write_parquet = pd.io.parquet.PyArrowImpl.writepd.io.parquet.PyArrowImpl.write = _patched_pd_write_parquetdef _patched_dask_read_pyarrow_parquet_piece(*args, **kwargs): return _orig_dask_read_pyarrow_parquet_piece(*args, **kwargs).applymap(_deserialize_ndarray)_orig_dask_read_pyarrow_parquet_piece = dd.io.parquet._read_pyarrow_parquet_piecedd.io.parquet._read_pyarrow_parquet_piece = _patched_dask_read_pyarrow_parquet_piecedef _patched_dd_write_partition_pyarrow(df: pd.DataFrame, *args, **kwargs): return _orig_dd_write_partition_pyarrow(df.applymap(_serialize_ndarray), *args, **kwargs)_orig_dd_write_partition_pyarrow = dd.io.parquet._write_partition_pyarrowdd.io.parquet._write_partition_pyarrow = _patched_dd_write_partition_pyarrow
You can then use the tricks specified in the question to get nested arrays in pandas cells (in-memory), and the above will act as a "poor-man's" codec serializing the arrays into byte streams (which different serialization schemes such as parquet can handle)