pandas create new column based on values from other columns / apply a function of multiple columns, row-wise pandas create new column based on values from other columns / apply a function of multiple columns, row-wise python python

pandas create new column based on values from other columns / apply a function of multiple columns, row-wise


OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:

def label_race (row):   if row['eri_hispanic'] == 1 :      return 'Hispanic'   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :      return 'Two Or More'   if row['eri_nat_amer'] == 1 :      return 'A/I AK Native'   if row['eri_asian'] == 1:      return 'Asian'   if row['eri_afr_amer']  == 1:      return 'Black/AA'   if row['eri_hawaiian'] == 1:      return 'Haw/Pac Isl.'   if row['eri_white'] == 1:      return 'White'   return 'Other'

You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".

Next, use the apply function in pandas to apply the function - e.g.

df.apply (lambda row: label_race(row), axis=1)

Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

0           White1        Hispanic2           White3           White4           Other5           White6     Two Or More7           White8    Haw/Pac Isl.9           White

If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

The resultant dataframe looks like this (scroll to the right to see the new column):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label0      MOST    JEFF      E             0          0             0              0             0          1       White         White1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More7    PACINO      AL      E             0          0             0              0             0          1       White         White8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White


Since this is the first Google result for 'pandas new column from others', here's a simple example:

import pandas as pd# make a simple dataframedf = pd.DataFrame({'a':[1,2], 'b':[3,4]})df#    a  b# 0  1  3# 1  2  4# create an unattached column with an indexdf.apply(lambda row: row.a + row.b, axis=1)# 0    4# 1    6# do same but attach it to the dataframedf['c'] = df.apply(lambda row: row.a + row.b, axis=1)df#    a  b  c# 0  1  3  4# 1  2  4  6

If you get the SettingWithCopyWarning you can do it this way also:

fn = lambda row: row.a + row.b # define a function for the new columncol = df.apply(fn, axis=1) # get column data with an indexdf = df.assign(c=col.values) # assign values to column 'c'

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values})

And here's the documentation for apply, and assign.


The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:


First, define conditions:

conditions = [    df['eri_hispanic'] == 1,    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),    df['eri_nat_amer'] == 1,    df['eri_asian'] == 1,    df['eri_afr_amer'] == 1,    df['eri_hawaiian'] == 1,    df['eri_white'] == 1,]

Now, define the corresponding outputs:

outputs = [    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other')pd.Series(res)

0           White1        Hispanic2           White3           White4           Other5           White6     Two Or More7           White8    Haw/Pac Isl.9           Whitedtype: object

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000)In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)In [44]: %%timeit    ...: conditions = [    ...:     df['eri_hispanic'] == 1,    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),    ...:     df['eri_nat_amer'] == 1,    ...:     df['eri_asian'] == 1,    ...:     df['eri_afr_amer'] == 1,    ...:     df['eri_hawaiian'] == 1,    ...:     df['eri_white'] == 1,    ...: ]    ...:    ...: outputs = [    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'    ...: ]    ...:    ...: np.select(conditions, outputs, 'Other')    ...:    ...:3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.