Combine two columns of text in pandas dataframe Combine two columns of text in pandas dataframe python python

Combine two columns of text in pandas dataframe


If both columns are strings, you can concatenate them directly:

df["period"] = df["Year"] + df["quarter"]

If one (or both) of the columns are not string typed, you should convert it (them) first,

df["period"] = df["Year"].astype(str) + df["quarter"]

###Beware of NaNs when doing this!


If you need to join multiple string columns, you can use agg:

df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)

Where "-" is the separator.


Small data-sets (< 150rows)

[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]

or slightly slower but more compact:

df.Year.str.cat(df.quarter)

Larger data sets (> 150rows)

df['Year'].astype(str) + df['quarter']

UPDATE: Timing graph Pandas 0.23.4

enter image description here

Let's test it on 200K rows DF:

In [250]: dfOut[250]:   Year quarter0  2014      q11  2015      q2In [251]: df = pd.concat([df] * 10**5)In [252]: df.shapeOut[252]: (200000, 2)

UPDATE: new timings using Pandas 0.19.0

Timing without CPU/GPU optimization (sorted from fastest to slowest):

In [107]: %timeit df['Year'].astype(str) + df['quarter']10 loops, best of 3: 131 ms per loopIn [106]: %timeit df['Year'].map(str) + df['quarter']10 loops, best of 3: 161 ms per loopIn [108]: %timeit df.Year.str.cat(df.quarter)10 loops, best of 3: 189 ms per loopIn [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)1 loop, best of 3: 567 ms per loopIn [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)1 loop, best of 3: 584 ms per loopIn [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)1 loop, best of 3: 24.7 s per loop

Timing using CPU/GPU optimization:

In [113]: %timeit df['Year'].astype(str) + df['quarter']10 loops, best of 3: 53.3 ms per loopIn [114]: %timeit df['Year'].map(str) + df['quarter']10 loops, best of 3: 65.5 ms per loopIn [115]: %timeit df.Year.str.cat(df.quarter)10 loops, best of 3: 79.9 ms per loopIn [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)1 loop, best of 3: 230 ms per loopIn [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)1 loop, best of 3: 230 ms per loopIn [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)1 loop, best of 3: 9.38 s per loop

Answer contribution by @anton-vbr


df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)

Yields this dataframe

   Year quarter  period0  2014      q1  2014q11  2015      q2  2015q2

This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']] with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1).

You can check more information about apply() method here


matomo