Pandas expand rows from list data available in column
DataFrame.explode
Since pandas >= 0.25.0
we have the explode
method for this, which expands a list to a row for each element and repeats the rest of the columns:
df.explode('column1').reset_index(drop=True)
Output
column1 column20 a 11 b 12 c 13 d 24 e 25 f 26 g 37 h 38 i 3
Since pandas >= 1.1.0
we have the ignore_index
argument, so we don't have to chain with reset_index
:
df.explode('column1', ignore_index=True)
Output
column1 column20 a 11 b 12 c 13 d 24 e 25 f 26 g 37 h 38 i 3
You can create DataFrame
by its constructor and stack
:
df2 = pd.DataFrame(df.column1.tolist(), index=df.column2) .stack() .reset_index(level=1, drop=True) .reset_index(name='column1')[['column1','column2']]print (df2) column1 column20 a 11 b 12 c 13 d 24 e 25 f 26 g 37 h 38 i 3
If need change ordering by subset [['column1','column2']]
, you can also omit first reset_index
:
df2 = pd.DataFrame(df.column1.tolist(), index=df.column2) .stack() .reset_index(name='column1')[['column1','column2']]print (df2) column1 column20 a 11 b 12 c 13 d 24 e 25 f 26 g 37 h 38 i 3
Another solution DataFrame.from_records
for creating DataFrame
from first column, then create Series
by stack
and join
to original DataFrame
:
df = pd.DataFrame({'column1': [['a','b','c'],['d','e','f'],['g','h','i']], 'column2':[1,2,3]})a = pd.DataFrame.from_records(df.column1.tolist()) .stack() .reset_index(level=1, drop=True) .rename('column1')print (a)0 a0 b0 c1 d1 e1 f2 g2 h2 iName: column1, dtype: objectprint (df.drop('column1', axis=1) .join(a) .reset_index(drop=True)[['column1','column2']]) column1 column20 a 11 b 12 c 13 d 24 e 25 f 26 g 37 h 38 i 3
Another solution is to use the result_type='expand'
argument of the pandas.apply
function available since pandas 0.23. Answering @splinter's question this method can be generalized -- see below:
import pandas as pdfrom numpy import arangedf = pd.DataFrame( {'column1' : [['a','b','c'],['d','e','f'],['g','h','i']], 'column2': [1,2,3]})pd.melt( df.join( df.apply(lambda row: row['column1'], axis=1, result_type='expand') ), value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2')[['column1','column2']]# can be generalized df = pd.DataFrame( {'column1' : [['a','b','c'],['d','e','f'],['g','h','i']], 'column2': [1,2,3], 'column3': [[1,2],[2,3],[3,4]], 'column4': [42,23,321], 'column5': ['a','b','c']})(pd.melt( df.join( df.apply(lambda row: row['column1'], axis=1, result_type='expand') ), value_vars=arange(df['column1'].shape[0]), value_name='column1', id_vars=df.columns[1:]) .drop(columns=['variable'])[list(df.columns[:1]) + list(df.columns[1:])] .sort_values(by=['column1']))
UPDATE (for Jwely's comment):if you have lists with varying length, you can do:
df = pd.DataFrame( {'column1' : [['a','b','c'],['d','f'],['g','h','i']], 'column2': [1,2,3]})longest = max(df['column1'].apply(lambda x: len(x)))pd.melt( df.join( df.apply(lambda row: row['column1'] if len(row['column1']) >= longest else row['column1'] + [None] * (longest - len(row['column1'])), axis=1, result_type='expand') ), value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2').query("column1 == column1")[['column1','column2']]