Split (explode) pandas dataframe string entry to separate rows
UPDATE3: it makes more sense to use implemented in Pandas 0.25.0Series.explode()
/ DataFrame.explode()
methods (extended in Pandas 1.3.0 to support multi-column explode) as it shown in the usage example:
for a single column:
In [1]: df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]], ...: 'B': 1, ...: 'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})In [2]: dfOut[2]: A B C0 [0, 1, 2] 1 [a, b, c]1 foo 1 NaN2 [] 1 []3 [3, 4] 1 [d, e]In [3]: df.explode('A')Out[3]: A B C0 0 1 [a, b, c]0 1 1 [a, b, c]0 2 1 [a, b, c]1 foo 1 NaN2 NaN 1 []3 3 1 [d, e]3 4 1 [d, e]
for multiple columns (for Pandas 1.3.0+):
In [4]: df.explode(list('AC'))Out[4]: A B C0 0 1 a0 1 1 b0 2 1 c1 foo 1 NaN2 NaN 1 NaN3 3 1 d3 4 1 e
UPDATE2: more generic vectorized function, which will work for multiple normal
and multiple list
columns
def explode(df, lst_cols, fill_value='', preserve_index=False): # make sure `lst_cols` is list-alike if (lst_cols is not None and len(lst_cols) > 0 and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))): lst_cols = [lst_cols] # all columns except `lst_cols` idx_cols = df.columns.difference(lst_cols) # calculate lengths of lists lens = df[lst_cols[0]].str.len() # preserve original index values idx = np.repeat(df.index.values, lens) # create "exploded" DF res = (pd.DataFrame({ col:np.repeat(df[col].values, lens) for col in idx_cols}, index=idx) .assign(**{col:np.concatenate(df.loc[lens>0, col].values) for col in lst_cols})) # append those rows that have empty lists if (lens == 0).any(): # at least one list in cells is empty res = (res.append(df.loc[lens==0, idx_cols], sort=False) .fillna(fill_value)) # revert the original index order res = res.sort_index() # reset index if requested if not preserve_index: res = res.reset_index(drop=True) return res
Demo:
Multiple list
columns - all list
columns must have the same # of elements in each row:
In [134]: dfOut[134]: aaa myid num text0 10 1 [1, 2, 3] [aa, bb, cc]1 11 2 [] []2 12 3 [1, 2] [cc, dd]3 13 4 [] []In [135]: explode(df, ['num','text'], fill_value='')Out[135]: aaa myid num text0 10 1 1 aa1 10 1 2 bb2 10 1 3 cc3 11 24 12 3 1 cc5 12 3 2 dd6 13 4
preserving original index values:
In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)Out[136]: aaa myid num text0 10 1 1 aa0 10 1 2 bb0 10 1 3 cc1 11 22 12 3 1 cc2 12 3 2 dd3 13 4
Setup:
df = pd.DataFrame({ 'aaa': {0: 10, 1: 11, 2: 12, 3: 13}, 'myid': {0: 1, 1: 2, 2: 3, 3: 4}, 'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []}, 'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}})
CSV column:
In [46]: dfOut[46]: var1 var2 var30 a,b,c 1 XX1 d,e,f,x,y 2 ZZIn [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')Out[47]: var1 var2 var30 a 1 XX1 b 1 XX2 c 1 XX3 d 2 ZZ4 e 2 ZZ5 f 2 ZZ6 x 2 ZZ7 y 2 ZZ
using this little trick we can convert CSV-like column to list
column:
In [48]: df.assign(var1=df.var1.str.split(','))Out[48]: var1 var2 var30 [a, b, c] 1 XX1 [d, e, f, x, y] 2 ZZ
UPDATE: generic vectorized approach (will work also for multiple columns):
Original DF:
In [177]: dfOut[177]: var1 var2 var30 a,b,c 1 XX1 d,e,f,x,y 2 ZZ
Solution:
first let's convert CSV strings to lists:
In [178]: lst_col = 'var1' In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})In [180]: xOut[180]: var1 var2 var30 [a, b, c] 1 XX1 [d, e, f, x, y] 2 ZZ
Now we can do this:
In [181]: pd.DataFrame({ ...: col:np.repeat(x[col].values, x[lst_col].str.len()) ...: for col in x.columns.difference([lst_col]) ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()] ...:Out[181]: var1 var2 var30 a 1 XX1 b 1 XX2 c 1 XX3 d 2 ZZ4 e 2 ZZ5 f 2 ZZ6 x 2 ZZ7 y 2 ZZ
OLD answer:
Inspired by @AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein's solution):
In [2]: df = pd.DataFrame( ...: [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'}, ...: {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}] ...: )In [3]: dfOut[3]: var1 var2 var30 a,b,c 1 XX1 d,e,f,x,y 2 ZZIn [4]: (df.set_index(df.columns.drop('var1',1).tolist()) ...: .var1.str.split(',', expand=True) ...: .stack() ...: .reset_index() ...: .rename(columns={0:'var1'}) ...: .loc[:, df.columns] ...: )Out[4]: var1 var2 var30 a 1 XX1 b 1 XX2 c 1 XX3 d 2 ZZ4 e 2 ZZ5 f 2 ZZ6 x 2 ZZ7 y 2 ZZ
After painful experimentation to find something faster than the accepted answer, I got this to work. It ran around 100x faster on the dataset I tried it on.
If someone knows a way to make this more elegant, by all means please modify my code. I couldn't find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns, but I'd imagine there's something else that works.
b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0b.columns = ['var1', 'var2'] # renaming var1
Pandas >= 0.25
Series and DataFrame methods define a .explode()
method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode
on that column.
df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})df var1 var20 a,b,c 11 d,e,f 2df.assign(var1=df['var1'].str.split(',')).explode('var1') var1 var20 a 10 b 10 c 11 d 21 e 21 f 2
Note that explode
only works on a single column (for now). To explode multiple columns at once, see below.
NaNs and empty lists get the treatment they deserve without you having to jump through hoops to get it right.
df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]})df var1 var20 d,e,f 11 22 NaN 3df['var1'].str.split(',')0 [d, e, f]1 []2 NaNdf.assign(var1=df['var1'].str.split(',')).explode('var1') var1 var20 d 10 e 10 f 11 2 # empty list entry becomes empty string after exploding 2 NaN 3 # NaN left un-touched
This is a serious advantage over ravel
/repeat
-based solutions (which ignore empty lists completely, and choke on NaNs).
Exploding Multiple Columns
Note that explode
only works on a single column at a time, but you can use apply
to explode multiple column at once:
df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': ['i,j,k', 'l,m,n'], 'var3': [1, 2]})df var1 var2 var30 a,b,c i,j,k 11 d,e,f l,m,n 2(df.set_index(['var3']) .apply(lambda col: col.str.split(',').explode()) .reset_index() .reindex(df.columns, axis=1))df var1 var2 var30 a i 11 b j 12 c k 13 d l 24 e m 25 f n 2
The idea is to set as the index, all the columns that should NOT be exploded, then explode the remaining columns via apply
. This works well when the lists are equally sized.