Combine two columns of text in pandas dataframe
If both columns are strings, you can concatenate them directly:
df["period"] = df["Year"] + df["quarter"]
If one (or both) of the columns are not string typed, you should convert it (them) first,
df["period"] = df["Year"].astype(str) + df["quarter"]
###Beware of NaNs when doing this!
If you need to join multiple string columns, you can use agg
:
df['period'] = df[['Year', 'quarter', ...]].agg('-'.join, axis=1)
Where "-" is the separator.
Small data-sets (< 150rows)
[''.join(i) for i in zip(df["Year"].map(str),df["quarter"])]
or slightly slower but more compact:
df.Year.str.cat(df.quarter)
Larger data sets (> 150rows)
df['Year'].astype(str) + df['quarter']
UPDATE: Timing graph Pandas 0.23.4
Let's test it on 200K rows DF:
In [250]: dfOut[250]: Year quarter0 2014 q11 2015 q2In [251]: df = pd.concat([df] * 10**5)In [252]: df.shapeOut[252]: (200000, 2)
UPDATE: new timings using Pandas 0.19.0
Timing without CPU/GPU optimization (sorted from fastest to slowest):
In [107]: %timeit df['Year'].astype(str) + df['quarter']10 loops, best of 3: 131 ms per loopIn [106]: %timeit df['Year'].map(str) + df['quarter']10 loops, best of 3: 161 ms per loopIn [108]: %timeit df.Year.str.cat(df.quarter)10 loops, best of 3: 189 ms per loopIn [109]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)1 loop, best of 3: 567 ms per loopIn [110]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)1 loop, best of 3: 584 ms per loopIn [111]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)1 loop, best of 3: 24.7 s per loop
Timing using CPU/GPU optimization:
In [113]: %timeit df['Year'].astype(str) + df['quarter']10 loops, best of 3: 53.3 ms per loopIn [114]: %timeit df['Year'].map(str) + df['quarter']10 loops, best of 3: 65.5 ms per loopIn [115]: %timeit df.Year.str.cat(df.quarter)10 loops, best of 3: 79.9 ms per loopIn [116]: %timeit df.loc[:, ['Year','quarter']].astype(str).sum(axis=1)1 loop, best of 3: 230 ms per loopIn [117]: %timeit df[['Year','quarter']].astype(str).sum(axis=1)1 loop, best of 3: 230 ms per loopIn [118]: %timeit df[['Year','quarter']].apply(lambda x : '{}{}'.format(x[0],x[1]), axis=1)1 loop, best of 3: 9.38 s per loop
Answer contribution by @anton-vbr
df = pd.DataFrame({'Year': ['2014', '2015'], 'quarter': ['q1', 'q2']})df['period'] = df[['Year', 'quarter']].apply(lambda x: ''.join(x), axis=1)
Yields this dataframe
Year quarter period0 2014 q1 2014q11 2015 q2 2015q2
This method generalizes to an arbitrary number of string columns by replacing df[['Year', 'quarter']]
with any column slice of your dataframe, e.g. df.iloc[:,0:2].apply(lambda x: ''.join(x), axis=1)
.
You can check more information about apply() method here