Do xarray or dask really support memory-mapping? Do xarray or dask really support memory-mapping? numpy numpy

Do xarray or dask really support memory-mapping?


xr.open_dataset with chunks= should not immediately load data into memory, it should create a dask.array, which evaluates lazily.

testfile = '/Users/mdurant/data/smith_sandwell_topo_v8_2.nc'arr = xr.open_dataset(testfile, chunks={'latitude': 6336//11, 'longitude': 10800//15}).ROSEarr 

<xarray.DataArray 'ROSE' (latitude: 6336, longitude: 10800)>dask.array</Users/mdurant/data/smith_sandwell_topo_v8_2.nc:/ROSE, shape=(6336, 10800), dtype=float64, chunksize=(576, 720)>Coordinates: * longitude (longitude) float32 0.0166667 0.05 0.0833333 0.116667 0.15 ... * latitude (latitude) float32 -72.0009 -71.9905 -71.9802 -71.9699 ...Attributes: long_name: Topography and Bathymetry ( 8123m -> -10799m) units: meters valid_range: [-32766 32767] unpacked_missing_value: -32767.0(note the dask.array in the above)

Many xarray operations on this may be lazy, and work chunkwise (and if you slice, only required chunks would be loaded)

arr.sum()

<xarray.DataArray 'ROSE' ()>dask.array<sum-aggregate, shape=(), dtype=float64, chunksize=()>

arr.sum().values    # evaluates

This is not the same as memory mapping, however, so I appreciate if this doesn't answer your question.

With dask's threaded scheduler, in-memory values are available to the other workers, so sharing would be quite efficient. Conversely, the distributed scheduler is quite good at recognising when results can be reused within a computation graph or between graphs.