Pandas: resample timeseries with groupby
In my original post, I suggested using pd.TimeGrouper
. Nowadays, use pd.Grouper
instead of pd.TimeGrouper
. The syntax is largely the same, but TimeGrouper
is now deprecated in favor of pd.Grouper
.
Moreover, while pd.TimeGrouper
could only group by DatetimeIndex, pd.Grouper
can group by datetime columns which you can specify through the key
parameter.
You could use a pd.Grouper
to group the DatetimeIndex'ed DataFrame by hour:
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
use count
to count the number of events in each group:
grouper['Event'].count()# Location# 2014-08-25 21:00:00 HK 1# LDN 1# 2014-08-25 22:00:00 LDN 2# Name: Event, dtype: int64
use unstack
to move the Location
index level to a column level:
grouper['Event'].count().unstack()# Out[49]: # Location HK LDN# 2014-08-25 21:00:00 1 1# 2014-08-25 22:00:00 NaN 2
and then use fillna
to change the NaNs into zeros.
Putting it all together,
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])result = grouper['Event'].count().unstack('Location').fillna(0)
yields
Location HK LDN2014-08-25 21:00:00 1 12014-08-25 22:00:00 0 2
Pandas 0.21 answer: TimeGrouper is getting deprecated
There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.
Option 1: Use groupby + resample
grouped = df.groupby('Location').resample('H')['Event'].count()
Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)
grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count()
They both will result in the following:
Location HK 2014-08-25 21:00:00 1LDN 2014-08-25 21:00:00 1 2014-08-25 22:00:00 2Name: Event, dtype: int64
And then reshape:
grouped.unstack('Location', fill_value=0)
Will output
Location HK LDN2014-08-25 21:00:00 1 12014-08-25 22:00:00 0 2
Multiple Column Group By
untubu is spot on with his answer but I wanted to add in what you could do if you had a third column, say Cost
and wanted to aggregate it like above. It was through combining unutbu's answer and this one that I found out how to do this and thought I would share for future users.
Create a DataFrame with Cost
column:
In[1]:import pandas as pdimport numpy as nptimes = pd.to_datetime([ "2014-08-25 21:00:00", "2014-08-25 21:04:00", "2014-08-25 22:07:00", "2014-08-25 22:09:00"])df = pd.DataFrame({ "Location": ["HK", "LDN", "LDN", "LDN"], "Event": ["foo", "bar", "baz", "qux"], "Cost": [20, 24, 34, 52]}, index = times)dfOut[1]: Location Event Cost2014-08-25 21:00:00 HK foo 202014-08-25 21:04:00 LDN bar 242014-08-25 22:07:00 LDN baz 342014-08-25 22:09:00 LDN qux 52
Now we group by using the agg
function to specify each column's aggregation method, e.g. count, mean, sum, etc.
In[2]:grp = df.groupby([pd.Grouper(freq = "1H"), "Location"]) \ .agg({"Event": np.size, "Cost": np.mean})grpOut[2]: Event Cost Location2014-08-25 21:00:00 HK 1 20 LDN 1 242014-08-25 22:00:00 LDN 2 43
Then the final unstack
with fill NaN
with zeros and display as int
because it's nice.
In[3]: grp.unstack().fillna(0).astype(int)Out[3]: Event CostLocation HK LDN HK LDN2014-08-25 21:00:00 1 1 20 242014-08-25 22:00:00 0 2 0 43