Resample in a rolling window using pandas
I had a similar issue dealing with a timedelta series where I wanted to take a moving average and then resample. Here is an example where I have 100 seconds of data. I take a rolling average of 10 second windows and then resample for every 5 seconds, taking the first entry in each resample bin. The result should be the previous 10 second average at 5 second increments. You could do something similar with month format instead of seconds:
df = pd.DataFrame(range(0,100), index=pd.TimedeltaIndex(range(0,100),'s'))df.rolling('10s').mean().resample('5s').first()
Result:
000:00:00 0.000:00:05 2.500:00:10 5.500:00:15 10.500:00:20 15.500:00:25 20.500:00:30 25.500:00:35 30.500:00:40 35.500:00:45 40.500:00:50 45.500:00:55 50.500:01:00 55.500:01:05 60.500:01:10 65.500:01:15 70.500:01:20 75.500:01:25 80.500:01:30 85.500:01:35 90.5
Here's an attempt - not super clean, but it might work.
Dummy data:
df = pd.DataFrame(data={'a': 1.}, index=pd.date_range(start='2001-1-1', periods=1000))
First define a function to decrease a date n
number of months. This needs to be cleaned up, but works for n<=12.
from datetime import datetime def decrease_month(date, n): assert(n <= 12) new_month = date.month - n year_offset = 0 if new_month <= 0: year_offset = -1 new_month = 12 + new_month return datetime(date.year + year_offset, new_month, 1)
Then, add 5 new columns for the 5 rolling periods that each date will cross.
for n in range(rolling_period): df['m_' + str(n)] = df.index.map(lambda x: decrease_month(x, n))
Then - use the melt
function to convert the data from wide to long, so each rolling period will have one entry.
df_m = pd.melt(df, id_vars='a')
You should be able to groupby the newly created column, and each date will represent the right 5 month rolling period.
In [222]: df_m.groupby('value').sum()Out[222]: avalue 2000-09-01 312000-10-01 592000-11-01 902000-12-01 1202001-01-01 1512001-02-01 1502001-03-01 1532001-04-01 1532001-05-01 1532001-06-01 1532001-07-01 153...
I have solved a similar problem with the following code:
interval = 5frames = []for base in range(interval): frame = data.resample(f"{interval}min", base=base).last() frames.append(frame)pd.concat(frames, axis=0).sort_index()
Here I create 5 data frames which are resampled at the same interval, but have different offsets (the base parameter). Then I just have to concatenate and sort them. Should usually be much more efficient than rolling + resampling (the only overhead is the sorting).