Pandas "Group By" Query on Large Data in HDFStore?
Heres a complete example.
import numpy as npimport pandas as pdimport osfname = 'groupby.h5'# create a framedf = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo'], 'B': ['one', 'one', 'one', 'two', 'one', 'one', 'one', 'two', 'two', 'two', 'one'], 'C': ['dull', 'dull', 'shiny', 'dull', 'dull', 'shiny', 'shiny', 'dull', 'shiny', 'shiny', 'shiny'], 'D': np.random.randn(11), 'E': np.random.randn(11), 'F': np.random.randn(11)})# create the store and append, using data_columns where I possibily# could aggregatewith pd.get_store(fname) as store: store.append('df',df,data_columns=['A','B','C']) print "store:\n%s" % store print "\ndf:\n%s" % store['df'] # get the groups groups = store.select_column('df','A').unique() print "\ngroups:%s" % groups # iterate over the groups and apply my operations l = [] for g in groups: grp = store.select('df',where = [ 'A=%s' % g ]) # this is a regular frame, aggregate however you would like l.append(grp[['D','E','F']].sum()) print "\nresult:\n%s" % pd.concat(l, keys = groups)os.remove(fname)
Output
store:<class 'pandas.io.pytables.HDFStore'>File path: groupby.h5/df frame_table (typ->appendable,nrows->11,ncols->6,indexers->[index],dc->[A,B,C])df: A B C D E F0 foo one dull -0.815212 -1.195488 -1.3469801 foo one dull -1.111686 -1.814385 -0.9743272 foo one shiny -1.069152 -1.926265 0.3603183 foo two dull -0.472180 0.698369 -1.0070104 bar one dull 1.329867 0.709621 1.8778985 bar one shiny -0.962906 0.489594 -0.6630686 bar one shiny -0.657922 -0.377705 0.0657907 bar two dull -0.172245 1.694245 1.3741898 foo two shiny -0.780877 -2.334895 -2.7474049 foo two shiny -0.257413 0.577804 -0.15931610 foo one shiny 0.737597 1.979373 -0.236070groups:Index([bar, foo], dtype=object)result:bar D -0.463206 E 2.515754 F 2.654810foo D -3.768923 E -4.015488 F -6.110789dtype: float64
Some caveats:
1) This methodology makes sense if your group density is relatively low. On the order of hundreds or thousands of groups. If you get more than that there are more efficient (but more complicated methods), and your function which you are applying (in this case sum
) become more restrictive.
Essentially you would iterator over the entire store by chunks, grouping as you go, but keeping the groups only semi-collapsed (imagine doing a mean, so you would need to keep a running total plus a running count, then divide at the end). So some operations would be a bit trickier, but could potentially handle MANY groups (and is really fast).
2) the efficiency of this could be improved by saving the coordinates (e.g. the group locations, but this is a bit more complicated)
3) multi-grouping is not possible with this scheme (it IS possible, but requires an approach more like 2) above
4) the columns that you want to group, MUST be a data_column!
5) you can combine any other filter you wish in the select btw (which is a sneeky way of doing multi-grouping btw, you just form 2 unique lists of group and iterator over the product of them, not extremely efficient if you have lots of groups, but can work)
HTH
let me know if this works for you