Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

python pandas multidimensional-array multi-index

Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.

# create an empty array of NaN of the right dimensionsshape = map(len, frame.index.levels)arr = np.full(shape, np.nan)# fill it using Numpy's advanced indexingarr[frame.index.codes] = frame.values.flat# ...or in Pandas < 0.24.0, use# arr[frame.index.labels] = frame.values.flat

Original solution. Given a setup similar to above, but in 3-D,

from pandas import DataFrame, MultiIndexfrom itertools import productindex = range(2), range(2), range(2)value = range(2 * 2 * 2)frame = DataFrame(value, columns=['value'],                  index=MultiIndex.from_product(index)).drop((1, 0, 1))print(frame)

we have

       value0 0 0      0    1      1  1 0      2    1      31 0 0      4  1 0      6    1      7

Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.

First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.

levels = map(tuple, frame.index.levels)index = list(product(*levels))frame = frame.reindex(index)print(frame)

which outputs

       value0 0 0      0    1      1  1 0      2    1      31 0 0      4    1    NaN  1 0      6    1      7

Now, reshape() will work as intended.

shape = map(len, frame.index.levels)print(frame.values.reshape(shape))

which outputs

[[[  0.   1.]  [  2.   3.]] [[  4.  nan]  [  6.   7.]]]

The (rather ugly) one-liner is

frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\     .reshape(map(len, frame.index.levels))

python pandas multidimensional-array multi-index

This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.

If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.

If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!

CodeHunter

Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last