How to split a dataframe string column into two columns? How to split a dataframe string column into two columns? python python

How to split a dataframe string column into two columns?


TL;DR version:

For the simple case of:

  • I have a text column with a delimiter and I want two columns

The simplest solution is:

df[['A', 'B']] = df['AB'].str.split(' ', 1, expand=True)

You must use expand=True if your strings have a non-uniform number of splits and you want None to replace the missing values.

Notice how, in either case, the .tolist() method is not necessary. Neither is zip().

In detail:

Andy Hayden's solution is most excellent in demonstrating the power of the str.extract() method.

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split() method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

>>> import pandas as pd>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})>>> df      AB0  A1-B11  A2-B2>>> df['AB_split'] = df['AB'].str.split('-')>>> df      AB  AB_split0  A1-B1  [A1, B1]1  A2-B2  [A2, B2]

1: If you're unsure what the first two parameters of .str.split() do,I recommend the docs for the plain Python version of the method.

But how do you go from:

  • a column containing two-element lists

to:

  • two columns, each containing the respective element of the lists?

Well, we need to take a closer look at the .str attribute of a column.

It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})>>> upper_lower_df   U0  A1  B2  C>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()>>> upper_lower_df   U  L0  A  a1  B  b2  C  c

But it also has an "indexing" interface for getting each element of a string by its index:

>>> df['AB'].str[0]0    A1    AName: AB, dtype: object>>> df['AB'].str[1]0    11    2Name: AB, dtype: object

Of course, this indexing interface of .str doesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:

>>> df['AB'].str.split('-', 1).str[0]0    A11    A2Name: AB, dtype: object>>> df['AB'].str.split('-', 1).str[1]0    B11    B2Name: AB, dtype: object

Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do

>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str>>> df      AB  AB_split   A   B0  A1-B1  [A1, B1]  A1  B11  A2-B2  [A2, B2]  A2  B2

Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split() method can do it for you with the expand=True parameter:

>>> df['AB'].str.split('-', 1, expand=True)    0   10  A1  B11  A2  B2

So, another way of accomplishing what we wanted is to do:

>>> df = df[['AB']]>>> df      AB0  A1-B11  A2-B2>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))      AB   A   B0  A1-B1  A1  B11  A2-B2  A2  B2

The expand=True version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn't deal well with splits of different lengths:

>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})>>> df         AB0     A1-B11     A2-B22  A3-B3-C3>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')Traceback (most recent call last):  [...]    ValueError: Length of values does not match length of index>>> 

But expand=True handles it nicely by placing None in the columns for which there aren't enough "splits":

>>> df.join(...     df['AB'].str.split('-', expand=True).rename(...         columns={0:'A', 1:'B', 2:'C'}...     )... )         AB   A   B     C0     A1-B1  A1  B1  None1     A2-B2  A2  B2  None2  A3-B3-C3  A3  B3    C3


There might be a better way, but this here's one approach:

                            row    0       00000 UNITED STATES    1             01000 ALABAMA    2  01001 Autauga County, AL    3  01003 Baldwin County, AL    4  01005 Barbour County, AL
df = pd.DataFrame(df.row.str.split(' ',1).tolist(),                                 columns = ['fips','row'])
   fips                 row0  00000       UNITED STATES1  01000             ALABAMA2  01001  Autauga County, AL3  01003  Baldwin County, AL4  01005  Barbour County, AL


You can extract the different parts out quite neatly using a regex pattern:

In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')Out[11]:     fips                    1           state           county state_code0  00000        UNITED STATES   UNITED STATES              NaN        NaN1  01000              ALABAMA         ALABAMA              NaN        NaN2  01001   Autauga County, AL             NaN   Autauga County         AL3  01003   Baldwin County, AL             NaN   Baldwin County         AL4  01005   Barbour County, AL             NaN   Barbour County         AL[5 rows x 5 columns]

To explain the somewhat long regex:

(?P<fips>\d{5})
  • Matches the five digits (\d) and names them "fips".

The next part:

((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))

Does either (|) one of two things:

(?P<state>[A-Z ]*$)
  • Matches any number (*) of capital letters or spaces ([A-Z ]) and names this "state" before the end of the string ($),

or

(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
  • matches anything else (.*) then
  • a comma and a space then
  • matches the two digit state_code before the end of the string ($).

In the example:
Note that the first two rows hit the "state" (leaving NaN in the county and state_code columns), whilst the last three hit the county, state_code (leaving NaN in the state column).