Split (explode) pandas dataframe string entry to separate rows Split (explode) pandas dataframe string entry to separate rows python python

Split (explode) pandas dataframe string entry to separate rows


UPDATE3: it makes more sense to use implemented in Pandas 0.25.0Series.explode() / DataFrame.explode() methods (extended in Pandas 1.3.0 to support multi-column explode) as it shown in the usage example:

for a single column:

In [1]: df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],   ...:                    'B': 1,   ...:                    'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})In [2]: dfOut[2]:           A  B          C0  [0, 1, 2]  1  [a, b, c]1        foo  1        NaN2         []  1         []3     [3, 4]  1     [d, e]In [3]: df.explode('A')Out[3]:     A  B          C0    0  1  [a, b, c]0    1  1  [a, b, c]0    2  1  [a, b, c]1  foo  1        NaN2  NaN  1         []3    3  1     [d, e]3    4  1     [d, e]

for multiple columns (for Pandas 1.3.0+):

In [4]: df.explode(list('AC'))Out[4]:     A  B    C0    0  1    a0    1  1    b0    2  1    c1  foo  1  NaN2  NaN  1  NaN3    3  1    d3    4  1    e

UPDATE2: more generic vectorized function, which will work for multiple normal and multiple list columns

def explode(df, lst_cols, fill_value='', preserve_index=False):    # make sure `lst_cols` is list-alike    if (lst_cols is not None        and len(lst_cols) > 0        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):        lst_cols = [lst_cols]    # all columns except `lst_cols`    idx_cols = df.columns.difference(lst_cols)    # calculate lengths of lists    lens = df[lst_cols[0]].str.len()    # preserve original index values        idx = np.repeat(df.index.values, lens)    # create "exploded" DF    res = (pd.DataFrame({                col:np.repeat(df[col].values, lens)                for col in idx_cols},                index=idx)             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)                            for col in lst_cols}))    # append those rows that have empty lists    if (lens == 0).any():        # at least one list in cells is empty        res = (res.append(df.loc[lens==0, idx_cols], sort=False)                  .fillna(fill_value))    # revert the original index order    res = res.sort_index()    # reset index if requested    if not preserve_index:                res = res.reset_index(drop=True)    return res

Demo:

Multiple list columns - all list columns must have the same # of elements in each row:

In [134]: dfOut[134]:   aaa  myid        num          text0   10     1  [1, 2, 3]  [aa, bb, cc]1   11     2         []            []2   12     3     [1, 2]      [cc, dd]3   13     4         []            []In [135]: explode(df, ['num','text'], fill_value='')Out[135]:   aaa  myid num text0   10     1   1   aa1   10     1   2   bb2   10     1   3   cc3   11     24   12     3   1   cc5   12     3   2   dd6   13     4

preserving original index values:

In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)Out[136]:   aaa  myid num text0   10     1   1   aa0   10     1   2   bb0   10     1   3   cc1   11     22   12     3   1   cc2   12     3   2   dd3   13     4

Setup:

df = pd.DataFrame({ 'aaa': {0: 10, 1: 11, 2: 12, 3: 13}, 'myid': {0: 1, 1: 2, 2: 3, 3: 4}, 'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []}, 'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}})

CSV column:

In [46]: dfOut[46]:        var1  var2 var30      a,b,c     1   XX1  d,e,f,x,y     2   ZZIn [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')Out[47]:  var1  var2 var30    a     1   XX1    b     1   XX2    c     1   XX3    d     2   ZZ4    e     2   ZZ5    f     2   ZZ6    x     2   ZZ7    y     2   ZZ

using this little trick we can convert CSV-like column to list column:

In [48]: df.assign(var1=df.var1.str.split(','))Out[48]:              var1  var2 var30        [a, b, c]     1   XX1  [d, e, f, x, y]     2   ZZ

UPDATE: generic vectorized approach (will work also for multiple columns):

Original DF:

In [177]: dfOut[177]:        var1  var2 var30      a,b,c     1   XX1  d,e,f,x,y     2   ZZ

Solution:

first let's convert CSV strings to lists:

In [178]: lst_col = 'var1' In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})In [180]: xOut[180]:              var1  var2 var30        [a, b, c]     1   XX1  [d, e, f, x, y]     2   ZZ

Now we can do this:

In [181]: pd.DataFrame({     ...:     col:np.repeat(x[col].values, x[lst_col].str.len())     ...:     for col in x.columns.difference([lst_col])     ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]     ...:Out[181]:  var1  var2 var30    a     1   XX1    b     1   XX2    c     1   XX3    d     2   ZZ4    e     2   ZZ5    f     2   ZZ6    x     2   ZZ7    y     2   ZZ

OLD answer:

Inspired by @AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein's solution):

In [2]: df = pd.DataFrame(   ...:    [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},   ...:     {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]   ...: )In [3]: dfOut[3]:        var1  var2 var30      a,b,c     1   XX1  d,e,f,x,y     2   ZZIn [4]: (df.set_index(df.columns.drop('var1',1).tolist())   ...:    .var1.str.split(',', expand=True)   ...:    .stack()   ...:    .reset_index()   ...:    .rename(columns={0:'var1'})   ...:    .loc[:, df.columns]   ...: )Out[4]:  var1  var2 var30    a     1   XX1    b     1   XX2    c     1   XX3    d     2   ZZ4    e     2   ZZ5    f     2   ZZ6    x     2   ZZ7    y     2   ZZ


After painful experimentation to find something faster than the accepted answer, I got this to work. It ran around 100x faster on the dataset I tried it on.

If someone knows a way to make this more elegant, by all means please modify my code. I couldn't find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns, but I'd imagine there's something else that works.

b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0b.columns = ['var1', 'var2'] # renaming var1


Pandas >= 0.25

Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.

Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode on that column.

df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})df    var1  var20  a,b,c     11  d,e,f     2df.assign(var1=df['var1'].str.split(',')).explode('var1')  var1  var20    a     10    b     10    c     11    d     21    e     21    f     2

Note that explode only works on a single column (for now). To explode multiple columns at once, see below.

NaNs and empty lists get the treatment they deserve without you having to jump through hoops to get it right.

df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]})df    var1  var20  d,e,f     11            22    NaN     3df['var1'].str.split(',')0    [d, e, f]1           []2          NaNdf.assign(var1=df['var1'].str.split(',')).explode('var1')  var1  var20    d     10    e     10    f     11          2  # empty list entry becomes empty string after exploding 2  NaN     3  # NaN left un-touched

This is a serious advantage over ravel/repeat -based solutions (which ignore empty lists completely, and choke on NaNs).


Exploding Multiple Columns

Note that explode only works on a single column at a time, but you can use apply to explode multiple column at once:

df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'],                    'var2': ['i,j,k', 'l,m,n'],                    'var3': [1, 2]})df    var1   var2  var30  a,b,c  i,j,k     11  d,e,f  l,m,n     2(df.set_index(['var3'])    .apply(lambda col: col.str.split(',').explode())   .reset_index()   .reindex(df.columns, axis=1))df  var1 var2  var30    a    i     11    b    j     12    c    k     13    d    l     24    e    m     25    f    n     2

The idea is to set as the index, all the columns that should NOT be exploded, then explode the remaining columns via apply. This works well when the lists are equally sized.