Pandas expand rows from list data available in column Pandas expand rows from list data available in column pandas pandas

Pandas expand rows from list data available in column


DataFrame.explode

Since pandas >= 0.25.0 we have the explode method for this, which expands a list to a row for each element and repeats the rest of the columns:

df.explode('column1').reset_index(drop=True)

Output

  column1  column20       a        11       b        12       c        13       d        24       e        25       f        26       g        37       h        38       i        3

Since pandas >= 1.1.0 we have the ignore_index argument, so we don't have to chain with reset_index:

df.explode('column1', ignore_index=True)

Output

  column1  column20       a        11       b        12       c        13       d        24       e        25       f        26       g        37       h        38       i        3


You can create DataFrame by its constructor and stack:

 df2 = pd.DataFrame(df.column1.tolist(), index=df.column2)        .stack()        .reset_index(level=1, drop=True)        .reset_index(name='column1')[['column1','column2']]print (df2)  column1  column20       a        11       b        12       c        13       d        24       e        25       f        26       g        37       h        38       i        3

If need change ordering by subset [['column1','column2']], you can also omit first reset_index:

df2 = pd.DataFrame(df.column1.tolist(), index=df.column2)        .stack()        .reset_index(name='column1')[['column1','column2']]print (df2)  column1  column20       a        11       b        12       c        13       d        24       e        25       f        26       g        37       h        38       i        3

Another solution DataFrame.from_records for creating DataFrame from first column, then create Series by stack and join to original DataFrame:

df = pd.DataFrame({'column1': [['a','b','c'],['d','e','f'],['g','h','i']],                   'column2':[1,2,3]})a = pd.DataFrame.from_records(df.column1.tolist())                .stack()                .reset_index(level=1, drop=True)                .rename('column1')print (a)0    a0    b0    c1    d1    e1    f2    g2    h2    iName: column1, dtype: objectprint (df.drop('column1', axis=1)         .join(a)         .reset_index(drop=True)[['column1','column2']])  column1  column20       a        11       b        12       c        13       d        24       e        25       f        26       g        37       h        38       i        3


Another solution is to use the result_type='expand' argument of the pandas.apply function available since pandas 0.23. Answering @splinter's question this method can be generalized -- see below:

import pandas as pdfrom numpy import arangedf = pd.DataFrame(    {'column1' : [['a','b','c'],['d','e','f'],['g','h','i']],    'column2': [1,2,3]})pd.melt(    df.join(        df.apply(lambda row: row['column1'], axis=1, result_type='expand')        ), value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2')[['column1','column2']]# can be generalized df = pd.DataFrame(    {'column1' : [['a','b','c'],['d','e','f'],['g','h','i']],    'column2': [1,2,3],    'column3': [[1,2],[2,3],[3,4]],    'column4': [42,23,321],    'column5': ['a','b','c']})(pd.melt(    df.join(        df.apply(lambda row: row['column1'], axis=1, result_type='expand')        ), value_vars=arange(df['column1'].shape[0]), value_name='column1', id_vars=df.columns[1:]) .drop(columns=['variable'])[list(df.columns[:1]) + list(df.columns[1:])] .sort_values(by=['column1']))

UPDATE (for Jwely's comment):if you have lists with varying length, you can do:

df = pd.DataFrame(    {'column1' : [['a','b','c'],['d','f'],['g','h','i']],    'column2': [1,2,3]})longest = max(df['column1'].apply(lambda x: len(x)))pd.melt(    df.join(        df.apply(lambda row: row['column1'] if len(row['column1']) >= longest else row['column1'] + [None] * (longest - len(row['column1'])), axis=1, result_type='expand')    ), value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2').query("column1 == column1")[['column1','column2']]