Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array pandas pandas

Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array


Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.

# create an empty array of NaN of the right dimensionsshape = map(len, frame.index.levels)arr = np.full(shape, np.nan)# fill it using Numpy's advanced indexingarr[frame.index.codes] = frame.values.flat# ...or in Pandas < 0.24.0, use# arr[frame.index.labels] = frame.values.flat

Original solution. Given a setup similar to above, but in 3-D,

from pandas import DataFrame, MultiIndexfrom itertools import productindex = range(2), range(2), range(2)value = range(2 * 2 * 2)frame = DataFrame(value, columns=['value'],                  index=MultiIndex.from_product(index)).drop((1, 0, 1))print(frame)

we have

       value0 0 0      0    1      1  1 0      2    1      31 0 0      4  1 0      6    1      7

Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.

First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.

levels = map(tuple, frame.index.levels)index = list(product(*levels))frame = frame.reindex(index)print(frame)

which outputs

       value0 0 0      0    1      1  1 0      2    1      31 0 0      4    1    NaN  1 0      6    1      7

Now, reshape() will work as intended.

shape = map(len, frame.index.levels)print(frame.values.reshape(shape))

which outputs

[[[  0.   1.]  [  2.   3.]] [[  4.  nan]  [  6.   7.]]]

The (rather ugly) one-liner is

frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\     .reshape(map(len, frame.index.levels))


This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.

If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.

If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!