Using conditional to generate new column in pandas dataframe Using conditional to generate new column in pandas dataframe python python

Using conditional to generate new column in pandas dataframe


You can define a function which returns your different states "Full", "Partial", "Empty", etc and then use df.apply to apply the function to each row. Note that you have to pass the keyword argument axis=1 to ensure that it applies the function to rows.

import pandas as pddef alert(row):  if row['used'] == 1.0:    return 'Full'  elif row['used'] == 0.0:    return 'Empty'  elif 0.0 < row['used'] < 1.0:    return 'Partial'  else:    return 'Undefined'df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})df['alert'] = df.apply(alert, axis=1)#    portion  used    alert# 0        1   1.0     Full# 1        2   0.3  Partial# 2        3   0.0    Empty# 3        4   0.8  Partial


Alternatively you could do:

import pandas as pdimport numpy as npdf = pd.DataFrame(data={'portion':np.arange(10000), 'used':np.random.rand(10000)})%%timeitdf.loc[df['used'] == 1.0, 'alert'] = 'Full'df.loc[df['used'] == 0.0, 'alert'] = 'Empty'df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'

Which gives the same output but runs about 100 times faster on 10000 rows:

100 loops, best of 3: 2.91 ms per loop

Then using apply:

%timeit df['alert'] = df.apply(alert, axis=1)1 loops, best of 3: 287 ms per loop

I guess the choice depends on how big is your dataframe.


Use np.where, is usually fast

In [845]: df['alert'] = np.where(df.used == 1, 'Full',                                  np.where(df.used == 0, 'Empty', 'Partial'))In [846]: dfOut[846]:   portion  used    alert0        1   1.0     Full1        2   0.3  Partial2        3   0.0    Empty3        4   0.8  Partial

Timings

In [848]: df.shapeOut[848]: (100000, 3)In [849]: %timeit df['alert'] = np.where(df.used == 1, 'Full', np.where(df.used == 0, 'Empty', 'Partial'))100 loops, best of 3: 6.17 ms per loopIn [850]: %%timeit     ...: df.loc[df['used'] == 1.0, 'alert'] = 'Full'     ...: df.loc[df['used'] == 0.0, 'alert'] = 'Empty'     ...: df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'     ...:10 loops, best of 3: 21.9 ms per loopIn [851]: %timeit df['alert'] = df.apply(alert, axis=1)1 loop, best of 3: 2.79 s per loop