pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

python pandas numpy apply

OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:

def label_race (row):   if row['eri_hispanic'] == 1 :      return 'Hispanic'   if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :      return 'Two Or More'   if row['eri_nat_amer'] == 1 :      return 'A/I AK Native'   if row['eri_asian'] == 1:      return 'Asian'   if row['eri_afr_amer']  == 1:      return 'Black/AA'   if row['eri_hawaiian'] == 1:      return 'Haw/Pac Isl.'   if row['eri_white'] == 1:      return 'White'   return 'Other'

You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".

Next, use the apply function in pandas to apply the function - e.g.

df.apply (lambda row: label_race(row), axis=1)

Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:

0           White1        Hispanic2           White3           White4           Other5           White6     Two Or More7           White8    Haw/Pac Isl.9           White

If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.

df['race_label'] = df.apply (lambda row: label_race(row), axis=1)

The resultant dataframe looks like this (scroll to the right to see the new column):

      lname   fname rno_cd  eri_afr_amer  eri_asian  eri_hawaiian   eri_hispanic  eri_nat_amer  eri_white rno_defined    race_label0      MOST    JEFF      E             0          0             0              0             0          1       White         White1    CRUISE     TOM      E             0          0             0              1             0          0       White      Hispanic2      DEPP  JOHNNY    NaN             0          0             0              0             0          1     Unknown         White3     DICAP     LEO    NaN             0          0             0              0             0          1     Unknown         White4    BRANDO  MARLON      E             0          0             0              0             0          0       White         Other5     HANKS     TOM    NaN             0          0             0              0             0          1     Unknown         White6    DENIRO  ROBERT      E             0          1             0              0             0          1       White   Two Or More7    PACINO      AL      E             0          0             0              0             0          1       White         White8  WILLIAMS   ROBIN      E             0          0             1              0             0          0       White  Haw/Pac Isl.9  EASTWOOD   CLINT      E             0          0             0              0             0          1       White         White

python pandas numpy apply

Since this is the first Google result for 'pandas new column from others', here's a simple example:

import pandas as pd# make a simple dataframedf = pd.DataFrame({'a':[1,2], 'b':[3,4]})df#    a  b# 0  1  3# 1  2  4# create an unattached column with an indexdf.apply(lambda row: row.a + row.b, axis=1)# 0    4# 1    6# do same but attach it to the dataframedf['c'] = df.apply(lambda row: row.a + row.b, axis=1)df#    a  b  c# 0  1  3  4# 1  2  4  6

If you get the SettingWithCopyWarning you can do it this way also:

fn = lambda row: row.a + row.b # define a function for the new columncol = df.apply(fn, axis=1) # get column data with an indexdf = df.assign(c=col.values) # assign values to column 'c'

Source: https://stackoverflow.com/a/12555510/243392

And if your column name includes spaces you can use syntax like this:

df = df.assign(**{'some column name': col.values})

And here's the documentation for apply, and assign.

python pandas numpy apply

The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:

First, define conditions:

conditions = [    df['eri_hispanic'] == 1,    df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),    df['eri_nat_amer'] == 1,    df['eri_asian'] == 1,    df['eri_afr_amer'] == 1,    df['eri_hawaiian'] == 1,    df['eri_white'] == 1,]

Now, define the corresponding outputs:

outputs = [    'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']

Finally, using numpy.select:

res = np.select(conditions, outputs, 'Other')pd.Series(res)

0           White1        Hispanic2           White3           White4           Other5           White6     Two Or More7           White8    Haw/Pac Isl.9           Whitedtype: object

Why should numpy.select be used over apply? Here are some performance checks:

df = pd.concat([df]*1000)In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)In [44]: %%timeit    ...: conditions = [    ...:     df['eri_hispanic'] == 1,    ...:     df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),    ...:     df['eri_nat_amer'] == 1,    ...:     df['eri_asian'] == 1,    ...:     df['eri_afr_amer'] == 1,    ...:     df['eri_hawaiian'] == 1,    ...:     df['eri_white'] == 1,    ...: ]    ...:    ...: outputs = [    ...:     'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'    ...: ]    ...:    ...: np.select(conditions, outputs, 'Other')    ...:    ...:3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.

CodeHunter

pandas create new column based on values from other columns / apply a function of multiple columns, row-wise

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last