Faster way to groupby time of day in pandas
Both your "lambda-version" and the time property introduced in version 0.11 seems to be slow in version 0.11.0:
In [4]: %timeit all_data.groupby(all_data.index.time).mean()1 loops, best of 3: 11.8 s per loopIn [5]: %timeit all_data.groupby(lambda x: x.time()).mean()Exception RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignoredException RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignoredException RuntimeError: 'maximum recursion depth exceeded while calling a Python object' in <type 'exceptions.RuntimeError'> ignored1 loops, best of 3: 11.8 s per loop
With the current master both methods are considerably faster:
In [1]: pd.version.versionOut[1]: '0.11.1.dev-06cd915'In [5]: %timeit all_data.groupby(lambda x: x.time()).mean()1 loops, best of 3: 215 ms per loopIn [6]: %timeit all_data.groupby(all_data.index.time).mean()10 loops, best of 3: 113 ms per loop'0.11.1.dev-06cd915'
So you can either update to a master or wait for 0.11.1 which should be released this month.
It's faster to groupby the hour/minute/.. attributes rather than .time
. Here's Jeff's baseline:
In [11]: %timeit all_data.groupby(all_data.index.time).mean()1 loops, best of 3: 202 ms per loop
and without time it's much faster (the fewer attributes the faster it is):
In [12]: %timeit all_data.groupby(all_data.index.hour).mean()100 loops, best of 3: 5.53 ms per loopIn [13]: %timeit all_data.groupby([all_data.index.hour, all_data.index.minute, all_data.index.second, all_data.index.microsecond]).mean()10 loops, best of 3: 20.8 ms per loop
Note: time objects don't accept a nanosecond (but that's DatetimeIndex's resolution).
We should probably convert the index to have time objects to make this comparison fair:
In [21]: res = all_data.groupby([all_data.index.hour, all_data.index.minute, all_data.index.second, all_data.index.microsecond]).mean()In [22]: %timeit res.index.map(lambda t: datetime.time(*t))1000 loops, best of 3: 1.39 ms per loopIn [23]: res.index = res.index.map(lambda t: datetime.time(*t))
So it's around 10 times faster for maximum resolution, and you can easily make it coarser (and faster) e.g. groupby just the hour and minute..