How can I split a column of tuples in a Pandas dataframe?

You can do this by doing pd.DataFrame(col.tolist()) on that column:

In [2]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})In [3]: dfOut[3]:   a       b0  1  (1, 2)1  2  (3, 4)In [4]: df['b'].tolist()Out[4]: [(1, 2), (3, 4)]In [5]: pd.DataFrame(df['b'].tolist(), index=df.index)Out[5]:   0  10  1  21  3  4In [6]: df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)In [7]: dfOut[7]:   a       b  b1  b20  1  (1, 2)   1   21  2  (3, 4)   3   4

Note: in an earlier version, this answer recommended to use df['b'].apply(pd.Series) instead of pd.DataFrame(df['b'].tolist(), index=df.index). That works as well (because it makes a Series of each tuple, which is then seen as a row of a dataframe), but it is slower / uses more memory than the tolist version, as noted by the other answers here (thanks to denfromufa).

python numpy pandas dataframe tuples

The str accessor that is available to pandas.Series objects of dtype == object is actually an iterable.

Assume a pandas.DataFrame df:

df = pd.DataFrame(dict(col=[*zip('abcdefghij', range(10, 101, 10))]))df        col0   (a, 10)1   (b, 20)2   (c, 30)3   (d, 40)4   (e, 50)5   (f, 60)6   (g, 70)7   (h, 80)8   (i, 90)9  (j, 100)

We can test if it is an iterable:

from collections import Iterableisinstance(df.col.str, Iterable)True

We can then assign from it like we do other iterables:

var0, var1 = 'xy'print(var0, var1)x y

Simplest solution

So in one line we can assign both columns:

df['a'], df['b'] = df.col.strdf        col  a    b0   (a, 10)  a   101   (b, 20)  b   202   (c, 30)  c   303   (d, 40)  d   404   (e, 50)  e   505   (f, 60)  f   606   (g, 70)  g   707   (h, 80)  h   808   (i, 90)  i   909  (j, 100)  j  100

Faster solution

Only slightly more complicated, we can use zip to create a similar iterable:

df['c'], df['d'] = zip(*df.col)df        col  a    b  c    d0   (a, 10)  a   10  a   101   (b, 20)  b   20  b   202   (c, 30)  c   30  c   303   (d, 40)  d   40  d   404   (e, 50)  e   50  e   505   (f, 60)  f   60  f   606   (g, 70)  g   70  g   707   (h, 80)  h   80  h   808   (i, 90)  i   90  i   909  (j, 100)  j  100  j  100

Inline

Meaning, don't mutate existing df.

This works because assign takes keyword arguments where the keywords are the new (or existing) column names and the values will be the values of the new column. You can use a dictionary and unpack it with ** and have it act as the keyword arguments.

So this is a clever way of assigning a new column named 'g' that is the first item in the df.col.str iterable and 'h' that is the second item in the df.col.str iterable:

df.assign(**dict(zip('gh', df.col.str)))        col  g    h0   (a, 10)  a   101   (b, 20)  b   202   (c, 30)  c   303   (d, 40)  d   404   (e, 50)  e   505   (f, 60)  f   606   (g, 70)  g   707   (h, 80)  h   808   (i, 90)  i   909  (j, 100)  j  100

My version of the `list` approach

With modern list comprehension and variable unpacking.Note: also inline using join

df.join(pd.DataFrame([*df.col], df.index, [*'ef']))        col  g    h0   (a, 10)  a   101   (b, 20)  b   202   (c, 30)  c   303   (d, 40)  d   404   (e, 50)  e   505   (f, 60)  f   606   (g, 70)  g   707   (h, 80)  h   808   (i, 90)  i   909  (j, 100)  j  100

The mutating version would be

df[['e', 'f']] = pd.DataFrame([*df.col], df.index)

Naive Time Test

Short DataFrame

Use the one defined above:

%timeit df.assign(**dict(zip('gh', df.col.str)))%timeit df.assign(**dict(zip('gh', zip(*df.col))))%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))1.16 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)635 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)795 µs ± 42.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Long DataFrame

10^3 times bigger

df = pd.concat([df] * 1000, ignore_index=True)%timeit df.assign(**dict(zip('gh', df.col.str)))%timeit df.assign(**dict(zip('gh', zip(*df.col))))%timeit df.join(pd.DataFrame([*df.col], df.index, [*'gh']))11.4 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)2.1 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)2.33 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

python numpy pandas dataframe tuples

On much larger datasets, I found that .apply() is few orders of magnitude slower than pd.DataFrame(df['b'].values.tolist(), index=df.index).

This performance issue was closed in GitHub, although I do not agree with this decision:

performance issue - apply with pd.Series vs tuple #11615

It is based on this answer.

CodeHunter

How can I split a column of tuples in a Pandas dataframe?

Simplest solution

Faster solution

Inline

My version of the `list` approach

Naive Time Test

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

How can I split a column of tuples in a Pandas dataframe?

Simplest solution

Faster solution

Inline

My version of the list approach

Naive Time Test

Recent Posts

My version of the `list` approach