Parallelize pandas apply Parallelize pandas apply pandas pandas

Parallelize pandas apply


For the parallel approach this is the answer based on Parallelize apply after pandas groupby:

from joblib import Parallel, delayedimport multiprocessingdef get_nearest_dateParallel(df):    df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x))    df['daysAfterHoliday']  =  df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x))    return dfdef applyParallel(dfGrouped, func):    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)    return pd.concat(retLst)print ('parallel version: ')# 4 min 30 seconds%time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel)

but I prefer @NinjaPuppy's approach because it does not require O(n * number_of_holidays)


I think going down the route of trying stuff in parallel is probably over complicating this. I haven't tried this approach on a large sample so your mileage may vary, but it should give you an idea...

Let's just start with some dates...

import pandas as pddates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])

We'll use some holiday data from pandas.tseries.holiday - note that in effect we want a DatetimeIndex...

from pandas.tseries.holiday import USFederalHolidayCalendarholiday_calendar = USFederalHolidayCalendar()holidays = holiday_calendar.holidays('2016-01-01')

This gives us:

DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',               '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',               '2016-11-24', '2016-12-26',               ...               '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',               '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',               '2030-11-28', '2030-12-25'],              dtype='datetime64[ns]', length=150, freq=None)

Now we find the indices of the nearest nearest holiday for the original dates using searchsorted:

indices = holidays.searchsorted(dates)# array([1, 6, 9, 3])next_nearest = holidays[indices]# DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)

Then take the difference between the two:

next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days# array([15, 31, 14, 88])

You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the indices - 1 but it should act as (I hope) a relatively good base.


I think that the pandarallel package makes it way easier to do this now. Have not looked into it much, but should do the trick.