Improve Query Performance From a Large HDFStore Table with Pandas Improve Query Performance From a Large HDFStore Table with Pandas pandas pandas

Improve Query Performance From a Large HDFStore Table with Pandas


Here's some recommendations and a similar question is here

Use compression: see here. You should try this (this could make it faster / slower depending on exactly what you are querying), YMMV.

ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5

Use a hierarchical query in chunks. What I mean is this. Since you have a relatively small number of c_id and f_id that you care about, structure a single query something like this. This is kind of like using isin.

f_ids = list_of_f_ids that I care aboutc_ids = list_of_c_ids that I care aboutdef create_batches(l, maxn=32):    """ create a list of batches, maxed at maxn """    batches = []    while(True):        if len(l) <= maxn:            if len(l) > 0:                batches.append(l)            break        batches.append(l[0:maxn])        l = l[maxn:]    return batchesresults = []for f_id_batch in create_batches(f_id_list):    for c_id_batch in create_batches(c_id_list):        q = "f_id={f_id} & c_id={c_id}".format(                f_id=f_id_batch,                c_id=c_id_batch)        # you can include the max/min times in here as well (they would be max/min        # time for ALL the included batches though, maybe easy for you to compute        result = store.select('df',where=q)        # sub process this result        def f(x):            # you will need to filter out the min/max timestamps here (which I gather            # are somewhat dependent on f_id/c_id group            #### process the data and return something            # you could do something like: ``return x.describe()`` for simple stats         results.append(result.groupby(['f_id','c_id').apply(f))results = pd.concat(results)

The key here is to process so that the isin DOES not have more that 32 membersfor any variable that you are querying on. This is an internal numpy/pytables limitation.If you exceed this, the query will work, but it will drop that variable and do a reindexon ALL the data (which is NOT what you want here).

This way you will have a nice subset of data in memory over just a few loops. These queriesI think would take about the same time as most of your queries or so, but you will have way fewer.

The query time is roughly constant for a given subset (unless the data is ordered such that it it is completely indexed).

So the query scans 'blocks' of data (which is what the indexes point to). If you have lots of hits across many blocks then the query is slower.

Here's an example

In [5]: N = 100000000In [6]: df = DataFrame(np.random.randn(N,3),columns=['A','B','C'])In [7]: df['c_id'] = np.random.randint(0,10,size=N)In [8]: df['f_id'] = np.random.randint(0,10,size=N)In [9]: df.index = date_range('20130101',periods=N,freq='s')In [10]: df.to_hdf('test2.h5','df',mode='w',data_columns=['c_id','f_id'])In [11]: df.head()Out[11]:                             A         B         C  c_id  f_id2013-01-01 00:00:00  0.037287  1.153534  0.639669     8     72013-01-01 00:00:01  1.741046  0.459821  0.194282     8     32013-01-01 00:00:02 -2.273919 -0.141789  0.770567     1     12013-01-01 00:00:03  0.320879 -0.108426 -1.310302     8     62013-01-01 00:00:04 -1.445810 -0.777090 -0.148362     5     52013-01-01 00:00:05  1.608211  0.069196  0.025021     3     62013-01-01 00:00:06 -0.561690  0.613579  1.071438     8     22013-01-01 00:00:07  1.795043 -0.661966  1.210714     0     02013-01-01 00:00:08  0.176347 -0.461176  1.624514     3     62013-01-01 00:00:09 -1.084537  1.941610 -1.423559     9     12013-01-01 00:00:10 -0.101036  0.925010 -0.809951     0     92013-01-01 00:00:11 -1.185520  0.968519  2.871983     7     52013-01-01 00:00:12 -1.089267 -0.333969 -0.665014     3     62013-01-01 00:00:13  0.544427  0.130439  0.423749     5     72013-01-01 00:00:14  0.112216  0.404801 -0.061730     5     42013-01-01 00:00:15 -1.349838 -0.639435  0.993495     0     9In [2]: %timeit pd.read_hdf('test2.h5','df',where="f_id=[1] & c_id=[2]")1 loops, best of 3: 13.9 s per loopIn [3]: %timeit pd.read_hdf('test2.h5','df',where="f_id=[1,2] & c_id=[1,2]")1 loops, best of 3: 21.2 s per loopIn [4]: %timeit pd.read_hdf('test.2h5','df',where="f_id=[1,2,3] & c_id=[1,2,3]")1 loops, best of 3: 42.8 s per loop

This particular example is 5GB uncompressed and 2.9GB compressed. These results are on the compressed data. In THIS case it is actually quite a bit faster to use the uncompressed (e.g. the first loop taked 3.5s). This is 100MM rows.

So using the last example (4) you are getting 9x the data of the first in a little over 3x the query time.

However your speedup should be MUCH more, because you won't be selecting on individual timestamps, rather doing that later.

This whole approach takes into account that you have enough main memory to hold your results in the batch sizes (e.g. you are selecting a relatively small part of the set in the batch queries).