how to make a memory efficient multiple dimension groupby/stack using xarray? how to make a memory efficient multiple dimension groupby/stack using xarray? numpy numpy

how to make a memory efficient multiple dimension groupby/stack using xarray?


Like you, I ran it without issue when inputting a small time series (10,000 long). However, when inputting a 100,000 long time series xr.DataArraythe grp_obj2 for loop ran away and used all the memory of the system.

This is what I used to generate the time series xr.DataArray:

n = 10**5times = np.datetime64('2000-01-01') + np.arange(n) * np.timedelta64(5,'m')data = np.random.randn(n)time_da = xr.DataArray(data, name='rand_data', dims=('time'), coords={'time': times})# time_da.to_netcdf('rand_time_series.nc')

As you point out, Dask would be a way to solve it but I can't see a clear path at the moment...Typically, the kind of problem with Dask would be to:

  1. Make the input a dataset from a file (like NetCDF). This will not load the file in memory but allow Dask to pull data from disk one chunk at a time.
  2. Define all calculations with dask.delayed or dask.futures methods for entire body of code up until the writing the output. This is what allows Dask to chunk a small piece of data to read then write.
  3. Calculate one chunk of work and immediately write output to new dataset file. Effectively you ending up steaming one chunk of input to one chunk of output at a time (but also threaded/parallelized).

I tried importing Dask and breaking the input time_da xr.DataArray into chunks for Dask to work on but it didn't help. From what I can tell, the line stacked_da = xr.concat(s_list, dim=grp1) forces Dask to make a full copy of stacked_da in memory and much more...One workaround to this is to write stacked_da to disk then immediately read it again:

##For group1xr.concat(s_list, dim=grp1).to_netcdf('stacked_da1.nc')stacked_da = xr.load_dataset('stacked_da1.nc')stacked_da[grp1] = grps1##For group2xr.concat(s_list, dim=grp2).to_netcdf('stacked_da2.nc')stacked_da = xr.load_dataset('stacked_da2.nc')stacked_da[grp2] = grps2

However, the file size for stacked_da1.nc is 19MB and stacked_da2.nc gets huge at 6.5GB. This is for time_da with 100,000 elements... so there's clearly something amiss...

Originally, it sounded like you want to subtract the mean of the groups from the time series data. It looks like Xarray docs has an example for that. http://xarray.pydata.org/en/stable/groupby.html#grouped-arithmetic


The key is to group once and loop over the groups and then group again on each of the groups and append it to list.

Next i concat and use pd.MultiIndex.from_product for the groups.

No Memory problems and no Dask needed and it only takes a few seconds to run.

here's the code, enjoy:

def time_series_stack(time_da, time_dim='time', grp1='hour', grp2='month',                      plot=True):    """Takes a time-series xr.DataArray objects and reshapes it using    grp1 and grp2. output is a xr.Dataset that includes the reshaped DataArray    , its datetime-series and the grps. plots the mean also"""    import xarray as xr    import pandas as pd    # try to infer the freq and put it into attrs for later reconstruction:    freq = pd.infer_freq(time_da[time_dim].values)    name = time_da.name    time_da.attrs['freq'] = freq    attrs = time_da.attrs    # drop all NaNs:    time_da = time_da.dropna(time_dim)    # first grouping:    grp_obj1 = time_da.groupby(time_dim + '.' + grp1)    da_list = []    t_list = []    for grp1_name, grp1_inds in grp_obj1.groups.items():        da = time_da.isel({time_dim: grp1_inds})        # second grouping:        grp_obj2 = da.groupby(time_dim + '.' + grp2)        for grp2_name, grp2_inds in grp_obj2.groups.items():            da2 = da.isel({time_dim: grp2_inds})            # extract datetimes and rewrite time coord to 'rest':            times = da2[time_dim]            times = times.rename({time_dim: 'rest'})            times.coords['rest'] = range(len(times))            t_list.append(times)            da2 = da2.rename({time_dim: 'rest'})            da2.coords['rest'] = range(len(da2))            da_list.append(da2)    # get group keys:    grps1 = [x for x in grp_obj1.groups.keys()]    grps2 = [x for x in grp_obj2.groups.keys()]    # concat and convert to dataset:    stacked_ds = xr.concat(da_list, dim='all').to_dataset(name=name)    stacked_ds[time_dim] = xr.concat(t_list, 'all')    # create a multiindex for the groups:    mindex = pd.MultiIndex.from_product([grps1, grps2], names=[grp1, grp2])    stacked_ds.coords['all'] = mindex    # unstack:    ds = stacked_ds.unstack('all')    ds.attrs = attrs    return ds