Pandas column of lists, create a row for each list element
A bit longer than I expected:
>>> df samples subject trial_num0 [-0.07, -2.9, -2.44] 1 11 [-1.52, -0.35, 0.1] 1 22 [-0.17, 0.57, -0.65] 1 33 [-0.82, -1.06, 0.47] 2 14 [0.79, 1.35, -0.09] 2 25 [1.17, 1.14, -1.79] 2 3>>>>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)>>> s.name = 'sample'>>>>>> df.drop('samples', axis=1).join(s) subject trial_num sample0 1 1 -0.070 1 1 -2.900 1 1 -2.441 1 2 -1.521 1 2 -0.351 1 2 0.102 1 3 -0.172 1 3 0.572 1 3 -0.653 2 1 -0.823 2 1 -1.063 2 1 0.474 2 2 0.794 2 2 1.354 2 2 -0.095 2 3 1.175 2 3 1.145 2 3 -1.79
If you want sequential index, you can apply reset_index(drop=True)
to the result.
update:
>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()>>> res = res.reset_index()>>> res.columns = ['subject','trial_num','sample_num','sample']>>> res subject trial_num sample_num sample0 1 1 0 1.891 1 1 1 -2.922 1 1 2 0.343 1 2 0 0.854 1 2 1 0.245 1 2 2 0.726 1 3 0 -0.967 1 3 1 -2.728 1 3 2 -0.119 2 1 0 -1.3310 2 1 1 3.1311 2 1 2 -0.6512 2 2 0 0.1013 2 2 1 0.6514 2 2 2 0.1515 2 3 0 0.6416 2 3 1 -0.1017 2 3 2 -0.76
Pandas >= 0.25
Series and DataFrame methods define a .explode()
method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
df = pd.DataFrame({ 'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan], 'var2': [1, 2, 3, 4]})df var1 var20 [a, b, c] 11 [d, e] 22 [] 33 NaN 4df.explode('var1') var1 var20 a 10 b 10 c 11 d 21 e 22 NaN 3 # empty list converted to NaN3 NaN 4 # NaN entry preserved as-is# to reset the index to be monotonically increasing...df.explode('var1').reset_index(drop=True) var1 var20 a 11 b 12 c 13 d 24 e 25 NaN 36 NaN 4
Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat
-based solutions).
However, you should note that explode
only works on a single column (for now).
P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode
. See this (very much) related answer by me.
UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode()
.
lst_col = 'samples'r = pd.DataFrame({ col:np.repeat(df[col].values, df[lst_col].str.len()) for col in df.columns.drop(lst_col)} ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
Result:
In [103]: rOut[103]: samples subject trial_num0 0.10 1 11 -0.20 1 12 0.05 1 13 0.25 1 24 1.32 1 25 -0.17 1 26 0.64 1 37 -0.22 1 38 -0.71 1 39 -0.03 2 110 -0.65 2 111 0.76 2 112 1.77 2 213 0.89 2 214 0.65 2 215 -0.98 2 316 0.65 2 317 -0.30 2 3
PS here you may find a bit more generic solution
UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:
in the following line we are repeating values in one column N
times where N
- is the length of the corresponding list:
In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)
this can be generalized for all columns, containing scalar values:
In [11]: pd.DataFrame({ ...: col:np.repeat(df[col].values, df[lst_col].str.len()) ...: for col in df.columns.drop(lst_col)} ...: )Out[11]: trial_num subject0 1 11 1 12 1 13 2 14 2 15 2 16 3 1.. ... ...11 1 212 2 213 2 214 2 215 3 216 3 217 3 2[18 rows x 2 columns]
using np.concatenate()
we can flatten all values in the list
column (samples
) and get a 1D vector:
In [12]: np.concatenate(df[lst_col].values)Out[12]: array([-1.04, -0.58, -1.32, 0.82, -0.59, -0.34, 0.25, 2.09, 0.12, 0.83, -0.88, 0.68, 0.55, -0.56, 0.65, -0.04, 0.36, -0.31])
putting all this together:
In [13]: pd.DataFrame({ ...: col:np.repeat(df[col].values, df[lst_col].str.len()) ...: for col in df.columns.drop(lst_col)} ...: ).assign(**{lst_col:np.concatenate(df[lst_col].values)})Out[13]: trial_num subject samples0 1 1 -1.041 1 1 -0.582 1 1 -1.323 2 1 0.824 2 1 -0.595 2 1 -0.346 3 1 0.25.. ... ... ...11 1 2 0.6812 2 2 0.5513 2 2 -0.5614 2 2 0.6515 3 2 -0.0416 3 2 0.3617 3 2 -0.31[18 rows x 3 columns]
using pd.DataFrame()[df.columns]
will guarantee that we are selecting columns in the original order...