Pandas column of lists, create a row for each list element Pandas column of lists, create a row for each list element pandas pandas

Pandas column of lists, create a row for each list element


A bit longer than I expected:

>>> df                samples  subject  trial_num0  [-0.07, -2.9, -2.44]        1          11   [-1.52, -0.35, 0.1]        1          22  [-0.17, 0.57, -0.65]        1          33  [-0.82, -1.06, 0.47]        2          14   [0.79, 1.35, -0.09]        2          25   [1.17, 1.14, -1.79]        2          3>>>>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)>>> s.name = 'sample'>>>>>> df.drop('samples', axis=1).join(s)   subject  trial_num  sample0        1          1   -0.070        1          1   -2.900        1          1   -2.441        1          2   -1.521        1          2   -0.351        1          2    0.102        1          3   -0.172        1          3    0.572        1          3   -0.653        2          1   -0.823        2          1   -1.063        2          1    0.474        2          2    0.794        2          2    1.354        2          2   -0.095        2          3    1.175        2          3    1.145        2          3   -1.79

If you want sequential index, you can apply reset_index(drop=True) to the result.

update:

>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()>>> res = res.reset_index()>>> res.columns = ['subject','trial_num','sample_num','sample']>>> res    subject  trial_num  sample_num  sample0         1          1           0    1.891         1          1           1   -2.922         1          1           2    0.343         1          2           0    0.854         1          2           1    0.245         1          2           2    0.726         1          3           0   -0.967         1          3           1   -2.728         1          3           2   -0.119         2          1           0   -1.3310        2          1           1    3.1311        2          1           2   -0.6512        2          2           0    0.1013        2          2           1    0.6514        2          2           2    0.1515        2          3           0    0.6416        2          3           1   -0.1017        2          3           2   -0.76


Pandas >= 0.25

Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.

df = pd.DataFrame({    'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan],     'var2': [1, 2, 3, 4]})df        var1  var20  [a, b, c]     11     [d, e]     22         []     33        NaN     4df.explode('var1')  var1  var20    a     10    b     10    c     11    d     21    e     22  NaN     3  # empty list converted to NaN3  NaN     4  # NaN entry preserved as-is# to reset the index to be monotonically increasing...df.explode('var1').reset_index(drop=True)  var1  var20    a     11    b     12    c     13    d     24    e     25  NaN     36  NaN     4

Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).

However, you should note that explode only works on a single column (for now).

P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.


UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode().


lst_col = 'samples'r = pd.DataFrame({      col:np.repeat(df[col].values, df[lst_col].str.len())      for col in df.columns.drop(lst_col)}    ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]

Result:

In [103]: rOut[103]:    samples  subject  trial_num0      0.10        1          11     -0.20        1          12      0.05        1          13      0.25        1          24      1.32        1          25     -0.17        1          26      0.64        1          37     -0.22        1          38     -0.71        1          39     -0.03        2          110    -0.65        2          111     0.76        2          112     1.77        2          213     0.89        2          214     0.65        2          215    -0.98        2          316     0.65        2          317    -0.30        2          3

PS here you may find a bit more generic solution


UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:

in the following line we are repeating values in one column N times where N - is the length of the corresponding list:

In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)

this can be generalized for all columns, containing scalar values:

In [11]: pd.DataFrame({    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())    ...:           for col in df.columns.drop(lst_col)}    ...:         )Out[11]:    trial_num  subject0           1        11           1        12           1        13           2        14           2        15           2        16           3        1..        ...      ...11          1        212          2        213          2        214          2        215          3        216          3        217          3        2[18 rows x 2 columns]

using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:

In [12]: np.concatenate(df[lst_col].values)Out[12]: array([-1.04, -0.58, -1.32,  0.82, -0.59, -0.34,  0.25,  2.09,  0.12,  0.83, -0.88,  0.68,  0.55, -0.56,  0.65, -0.04,  0.36, -0.31])

putting all this together:

In [13]: pd.DataFrame({    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())    ...:           for col in df.columns.drop(lst_col)}    ...:         ).assign(**{lst_col:np.concatenate(df[lst_col].values)})Out[13]:    trial_num  subject  samples0           1        1    -1.041           1        1    -0.582           1        1    -1.323           2        1     0.824           2        1    -0.595           2        1    -0.346           3        1     0.25..        ...      ...      ...11          1        2     0.6812          2        2     0.5513          2        2    -0.5614          2        2     0.6515          3        2    -0.0416          3        2     0.3617          3        2    -0.31[18 rows x 3 columns]

using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order...