Resampling Within a Pandas MultiIndex

python pandas time-series multi-index hierarchical-data

pd.Grouperallows you to specify a "groupby instruction for a target object". Inparticular, you can use it to group by dates even if df.index is not a DatetimeIndex:

df.groupby(pd.Grouper(freq='2D', level=-1))

The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex.Moreover, you can use this in conjunction with other level values from the index:

level_values = df.index.get_level_valuesresult = (df.groupby([level_values(i) for i in [0,1]]                      +[pd.Grouper(freq='2D', level=-1)]).sum())

It looks a bit awkward, but using_Grouper turns out to be much faster than my originalsuggestion, using_reset_index:

import numpy as npimport pandas as pdimport datetime as DTdef using_Grouper(df):    level_values = df.index.get_level_values    return (df.groupby([level_values(i) for i in [0,1]]                       +[pd.Grouper(freq='2D', level=-1)]).sum())def using_reset_index(df):    df = df.reset_index(level=[0, 1])    return df.groupby(['State','City']).resample('2D').sum()def using_stack(df):    # http://stackoverflow.com/a/15813787/190597    return (df.unstack(level=[0,1])              .resample('2D').sum()              .stack(level=[2,1])              .swaplevel(2,0))def make_orig():    values_a = range(16)    values_b = range(10, 26)    states = ['Georgia']*8 + ['Alabama']*8    cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4    dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)    df = pd.DataFrame(        {'value_a': values_a, 'value_b': values_b},        index = [states, cities, dates])    df.index.names = ['State', 'City', 'Date']    return dfdef make_df(N):    dates = pd.date_range('2000-1-1', periods=N)    states = np.arange(50)    cities = np.arange(10)    index = pd.MultiIndex.from_product([states, cities, dates],                                        names=['State', 'City', 'Date'])    df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,                      columns=['value_a', 'value_b'])    return dfdf = make_orig()print(using_Grouper(df))

yields

                               value_a  value_bState   City       Date                        Alabama Mobile     2012-01-01       17       37                   2012-01-03       21       41        Montgomery 2012-01-01       25       45                   2012-01-03       29       49Georgia Atlanta    2012-01-01        1       21                   2012-01-03        5       25        Savanna    2012-01-01        9       29                   2012-01-03       13       33

Here is a benchmark comparing using_Grouper, using_reset_index, using_stack on a 5000-row DataFrame:

In [30]: df = make_df(10)In [34]: len(df)Out[34]: 5000In [32]: %timeit using_Grouper(df)100 loops, best of 3: 6.03 ms per loopIn [33]: %timeit using_stack(df)10 loops, best of 3: 22.3 ms per loopIn [31]: %timeit using_reset_index(df)1 loop, best of 3: 659 ms per loop

python pandas time-series multi-index hierarchical-data

You need the groupby() method and provide it with a pd.Grouper for each level of your MultiIndex you wish to maintain in the resulting DataFrame. You can then apply an operation of choice.

To resample date or timestamp levels, you need to set the freq argument with the frequency of choice — a similar approach using pd.TimeGrouper() is deprecated in favour of pd.Grouper() with the freq argument set.

This should give you the DataFrame you need:

df.groupby([pd.Grouper(level='State'),             pd.Grouper(level='City'),             pd.Grouper(level='Date', freq='2D')]          ).sum()

The Time Series Guide in the pandas documentation describes resample() as: "a time-based groupby, followed by a reduction method on each of its groups". Hence, using groupby() should technically be the same operation as using .resample() on a DataFrame with a single index.

The same paragraph points to the cookbook section on resampling for more advanced examples, where the 'Grouping using a MultiIndex' entry is highly relevant for this question. Hope that helps.

python pandas time-series multi-index hierarchical-data

An alternative using stack/unstack

df.unstack(level=[0,1]).resample('2D', how='sum').stack(level=[2,1]).swaplevel(2,0)                               value_a  value_bState   City       DateGeorgia Atlanta    2012-01-01        1       21Alabama Mobile     2012-01-01       17       37        Montgomery 2012-01-01       25       45Georgia Savanna    2012-01-01        9       29        Atlanta    2012-01-03        5       25Alabama Mobile     2012-01-03       21       41        Montgomery 2012-01-03       29       49Georgia Savanna    2012-01-03       13       33

Notes:

No idea about performance comparison
Possible pandas bug - stack(level=[2,1]) worked, but stack(level=[1,2]) failed

CodeHunter

Resampling Within a Pandas MultiIndex

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last