How to create pandas dataframes with more than 2 dimensions? How to create pandas dataframes with more than 2 dimensions? pandas pandas

How to create pandas dataframes with more than 2 dimensions?


Rather than using an n-dimensional Panel, you are probably better off using a two dimensional representation of data, but using MultiIndexes for the index, column or both.

For example:

np.random.seed(1618033)#Set 3 axis labels/dimsyears = np.arange(2000,2010) #Yearssamples = np.arange(0,20) #Samplespatients = np.array(["patient_%d" % i for i in range(0,3)]) #Patients#Create random 3D array to simulate data from dims aboveA_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)# Create the MultiIndex from years, samples and patients.midx = pd.MultiIndex.from_product([years, samples, patients])# Create sample data for each patient, and add the MultiIndex.patient_data = pd.DataFrame(np.random.randn(len(midx), 3), index = midx)>>> patient_data.head()                         0         1         22000 0 patient_0 -0.128005  0.371413 -0.078591       patient_1 -0.378728 -2.003226 -0.024424       patient_2  1.339083  0.408708  1.724094     1 patient_0 -0.997879 -0.251789 -0.976275       patient_1  0.131380 -0.901092  1.456144

Once you have data in this form, it is relatively easy to juggle it around. For example:

>>> patient_data.unstack(level=0).head()  # Years.                    0                                                                                              ...            2                                                                                                           2000      2001      2002      2003      2004      2005      2006      2007      2008      2009    ...         2000      2001      2002      2003      2004      2005      2006      2007      2008      20090 patient_0 -0.128005  0.051558  1.251120  0.666061 -1.048103  0.259231  1.535370  0.156281 -0.609149  0.360219    ...    -0.078591 -2.305314 -2.253770  0.865997  0.458720  1.479144 -0.214834 -0.791904  0.800452  0.235016  patient_1 -0.378728 -0.117470 -0.306892  0.810256  2.702960 -0.748132 -1.449984 -0.195038  1.151445  0.301487    ...    -0.024424  0.114843  0.143700  1.732072  0.602326  1.465946 -1.215020  0.648420  0.844932 -1.261558  patient_2  1.339083 -0.915771  0.246077  0.820608 -0.935617 -0.449514 -1.105256 -0.051772 -0.671971  0.213349    ...     1.724094  0.835418  0.000819  1.149556 -0.318513 -0.450519 -0.694412 -1.535343  1.035295  0.6277571 patient_0 -0.997879 -0.242597  1.028464  2.093807  1.380361  0.691210 -2.420800  1.593001  0.925579  0.540447    ...    -0.976275  1.928454 -0.626332 -0.049824 -0.912860  0.225834  0.277991  0.326982 -0.520260  0.788685  patient_1  0.131380  0.398155 -1.671873 -1.329554 -0.298208 -0.525148  0.897745 -0.125233 -0.450068 -0.688240    ...     1.456144 -0.503815 -1.329334  0.475751 -0.201466  0.604806 -0.640869 -1.381123  0.524899  0.041983

In order to select the data, please refere to the docs for MultiIndexing.


An alternative approach (to Alexander) that is derived from the structure of the input data is:

np.random.seed(1618033)#Set 3 axis labels/dimsyears = np.arange(2000,2010) #Yearssamples = np.arange(0,20) #Samplespatients = np.array(["patient_%d" % i for i in range(0,3)]) #Patients#Create random 3D array to simulate data from dims aboveA_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)# Reshape data to 2 dimensionsmaj_dim = 1for dim in A_3D.shape[:-1]:    maj_dim = maj_dim*dimnew_dims = (maj_dim, A_3D.shape[-1])A_3D = A_3D.reshape(new_dims)# Create the MultiIndex from years, samples and patients.midx = pd.MultiIndex.from_product([years, samples])# Note that Cartesian product order is the same as the # C-order used by default in ``reshape``.# Create sample data for each patient, and add the MultiIndex.patient_data = pd.DataFrame(data = A_3D,                             index = midx,                            columns = patients)>>>> patient_data.head()        patient_0  patient_1  patient_22000 0   0.727753   0.154701   0.205916     1   0.796355   0.597207   0.897153     2   0.603955   0.469707   0.580368     3   0.365432   0.852758   0.293725     4   0.906906   0.355509   0.994513


You should consider using xarray instead. From their documentation:

Panel, pandas’ data structure for 3D arrays, was always a second class data structure compared to the Series and DataFrame. To allow pandas developers to focus more on its core functionality built around the DataFrame, pandas removed Panel in favor of directing users who use multi-dimensional arrays to xarray.