How to create pandas dataframes with more than 2 dimensions?
Rather than using an n-dimensional Panel, you are probably better off using a two dimensional representation of data, but using MultiIndexes for the index, column or both.
For example:
np.random.seed(1618033)#Set 3 axis labels/dimsyears = np.arange(2000,2010) #Yearssamples = np.arange(0,20) #Samplespatients = np.array(["patient_%d" % i for i in range(0,3)]) #Patients#Create random 3D array to simulate data from dims aboveA_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)# Create the MultiIndex from years, samples and patients.midx = pd.MultiIndex.from_product([years, samples, patients])# Create sample data for each patient, and add the MultiIndex.patient_data = pd.DataFrame(np.random.randn(len(midx), 3), index = midx)>>> patient_data.head() 0 1 22000 0 patient_0 -0.128005 0.371413 -0.078591 patient_1 -0.378728 -2.003226 -0.024424 patient_2 1.339083 0.408708 1.724094 1 patient_0 -0.997879 -0.251789 -0.976275 patient_1 0.131380 -0.901092 1.456144
Once you have data in this form, it is relatively easy to juggle it around. For example:
>>> patient_data.unstack(level=0).head() # Years. 0 ... 2 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ... 2000 2001 2002 2003 2004 2005 2006 2007 2008 20090 patient_0 -0.128005 0.051558 1.251120 0.666061 -1.048103 0.259231 1.535370 0.156281 -0.609149 0.360219 ... -0.078591 -2.305314 -2.253770 0.865997 0.458720 1.479144 -0.214834 -0.791904 0.800452 0.235016 patient_1 -0.378728 -0.117470 -0.306892 0.810256 2.702960 -0.748132 -1.449984 -0.195038 1.151445 0.301487 ... -0.024424 0.114843 0.143700 1.732072 0.602326 1.465946 -1.215020 0.648420 0.844932 -1.261558 patient_2 1.339083 -0.915771 0.246077 0.820608 -0.935617 -0.449514 -1.105256 -0.051772 -0.671971 0.213349 ... 1.724094 0.835418 0.000819 1.149556 -0.318513 -0.450519 -0.694412 -1.535343 1.035295 0.6277571 patient_0 -0.997879 -0.242597 1.028464 2.093807 1.380361 0.691210 -2.420800 1.593001 0.925579 0.540447 ... -0.976275 1.928454 -0.626332 -0.049824 -0.912860 0.225834 0.277991 0.326982 -0.520260 0.788685 patient_1 0.131380 0.398155 -1.671873 -1.329554 -0.298208 -0.525148 0.897745 -0.125233 -0.450068 -0.688240 ... 1.456144 -0.503815 -1.329334 0.475751 -0.201466 0.604806 -0.640869 -1.381123 0.524899 0.041983
In order to select the data, please refere to the docs for MultiIndexing.
An alternative approach (to Alexander) that is derived from the structure of the input data is:
np.random.seed(1618033)#Set 3 axis labels/dimsyears = np.arange(2000,2010) #Yearssamples = np.arange(0,20) #Samplespatients = np.array(["patient_%d" % i for i in range(0,3)]) #Patients#Create random 3D array to simulate data from dims aboveA_3D = np.random.random((years.size, samples.size, len(patients))) #(10, 20, 3)# Reshape data to 2 dimensionsmaj_dim = 1for dim in A_3D.shape[:-1]: maj_dim = maj_dim*dimnew_dims = (maj_dim, A_3D.shape[-1])A_3D = A_3D.reshape(new_dims)# Create the MultiIndex from years, samples and patients.midx = pd.MultiIndex.from_product([years, samples])# Note that Cartesian product order is the same as the # C-order used by default in ``reshape``.# Create sample data for each patient, and add the MultiIndex.patient_data = pd.DataFrame(data = A_3D, index = midx, columns = patients)>>>> patient_data.head() patient_0 patient_1 patient_22000 0 0.727753 0.154701 0.205916 1 0.796355 0.597207 0.897153 2 0.603955 0.469707 0.580368 3 0.365432 0.852758 0.293725 4 0.906906 0.355509 0.994513
You should consider using xarray
instead. From their documentation:
Panel, pandas’ data structure for 3D arrays, was always a second class data structure compared to the Series and DataFrame. To allow pandas developers to focus more on its core functionality built around the DataFrame, pandas removed Panel in favor of directing users who use multi-dimensional arrays to xarray.