Query HDF5 in Pandas Query HDF5 in Pandas pandas pandas

Query HDF5 in Pandas


here are the docs for querying on non-index columns.

Create the test data. It is not clear how the original frame is constructed, e.g. whether its unique data and the ranges, so I have created a sample, with 10M rows, and a multi-level date range with the id column.

In [60]: np.random.seed(1234)In [62]: pd.set_option('display.max_rows',20)In [63]: index = pd.MultiIndex.from_product([np.arange(10000,11000),pd.date_range('19800101',periods=10000)],names=['id','date'])In [67]: df = DataFrame(dict(id2=np.random.randint(0,1000,size=len(index)),w=np.random.randn(len(index))),index=index).reset_index().set_index(['id','date'])In [68]: dfOut[68]:                   id2         wid    date                     10000 1980-01-01  712  0.371372      1980-01-02  718 -1.255708      1980-01-03  581 -1.182727      1980-01-04  202 -0.947432      1980-01-05  493 -0.125346      1980-01-06  752  0.380210      1980-01-07  435 -0.444139      1980-01-08  128 -1.885230      1980-01-09  425  1.603619      1980-01-10  449  0.103737...               ...       ...10999 2007-05-09    8  0.624532      2007-05-10  669  0.268340      2007-05-11  918  0.134816      2007-05-12  979 -0.769406      2007-05-13  969 -0.242123      2007-05-14  950 -0.347884      2007-05-15   49 -1.284825      2007-05-16  922 -1.313928      2007-05-17  347 -0.521352      2007-05-18  353  0.189717[10000000 rows x 2 columns]

Write the data to disk, showing how to create a data column (note that the indexes are automatically queryable, this allows id2 to be queryable as well). This is de-facto equivalent to doing. This takes care of opening and closing the store (you can accomplish the same thing by opening a store, appending, and closing).

In order to query a column, it MUST BE A DATA COLUMN or an index of the frame.

In [70]: df.to_hdf('test.h5','df',mode='w',data_columns=['id2'],format='table')In [71]: !ls -ltr test.h5-rw-rw-r-- 1 jreback users 430540284 May 26 17:16 test.h5

Queries

In [80]: ids=[10101,10898]In [81]: start_date='20010101'In [82]: end_date='20010301'

You can specify dates as string (either in-line or as variables; you can also specify Timestamp like objects)

In [83]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date')Out[83]:                   id2         wid    date                     10000 2001-01-02  972 -0.146107      2001-01-03  954  1.420412      2001-01-04  567  1.077633      2001-01-05   87 -0.042838      2001-01-06   79 -1.791228      2001-01-07  744  1.110478      2001-01-08  237 -0.846086      2001-01-09  998 -0.696369      2001-01-10  266 -0.595555      2001-01-11  206 -0.294633...               ...       ...10999 2001-02-19  616 -0.745068      2001-02-20  577 -1.474748      2001-02-21  990 -1.276891      2001-02-22  939 -1.369558      2001-02-23  621 -0.214365      2001-02-24  396 -0.142100      2001-02-25  492 -0.204930      2001-02-26  478  1.839291      2001-02-27  688  0.291504      2001-02-28  356 -1.987554[58000 rows x 2 columns]

You can use in-line lists

In [84]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids')Out[84]:                   id2         wid    date                     10101 2001-01-02  722  1.620553      2001-01-03  849 -0.603468      2001-01-04  635 -1.419072      2001-01-05  331  0.521634      2001-01-06  730  0.008830      2001-01-07  706 -1.006412      2001-01-08   59  1.380005      2001-01-09  689  0.017830      2001-01-10  788 -3.090800      2001-01-11  704 -1.491824...               ...       ...10898 2001-02-19  530 -1.031167      2001-02-20  652 -0.019266      2001-02-21  472  0.638266      2001-02-22  540 -1.827251      2001-02-23  654 -1.020140      2001-02-24  328 -0.477425      2001-02-25  871 -0.892684      2001-02-26  166  0.894118      2001-02-27  806  0.648240      2001-02-28  824 -1.051539[116 rows x 2 columns]

You can also specify boolean expressions

In [85]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids & id2>500 & id2<600')Out[85]:                   id2         wid    date                     10101 2001-01-12  534 -0.220692      2001-01-14  596 -2.225393      2001-01-16  596  0.956239      2001-01-30  513 -2.528996      2001-02-01  572 -1.877398      2001-02-13  569 -0.940748      2001-02-14  541  1.035619      2001-02-21  571 -0.11654710898 2001-01-16  591  0.082564      2001-02-06  586  0.470872      2001-02-10  531 -0.536194      2001-02-16  586  0.949947      2001-02-19  530 -1.031167      2001-02-22  540 -1.827251

To answer your actual question I would do this (their is really not enough information, but I'll put some reasonable expectations):

  • Do't loop over queries, unless you have a very small number of absolute queries
  • Read the biggest chunk into memory that you can. Usually this is accomplished by selecting out the biggest ranges of data that you need, even if you select MORE data than you actually need.
  • Then subselect using in-memory expressions, which will generally be orders of magnitude faster.
  • List elements are limited to about 30 elements in total (this is current an implementation limit on the PyTables side). It will work if you specify more, but what will happen is that you will read in a lot of data, then it will be reindexed down (in-memory). So user needs to be aware of this.

So for example say that you have 1000 unique ids with 10000 dates per as my example demonstrates. You want to select say 200 of these, with a date range of 1000.

So in this case I would simply select on the dates then do the in-memory comparison, something like this:

df = pd.read_hdf('test.h5','df',where='date=>global_start_date & date<=global_end_date')df[df.isin(list_of_ids)]

You also might have dates that change per ids. So chunk them, this time using a list of ids.

Something like this:

output = []for i in len(list_of_ids) % 30:    ids = list_of_ids[i:(i+30)]    start_date = get_start_date_for_these_ids (global)    end_date = get_end_date_for_these_ids (global)    where = 'id=ids & start_date>=start_date & end_date<=end_date'    df = pd.read_hdf('test.h5','df',where=where)    output.append(df) final_result = concat(output)

The basic idea then is to select a superset of the data using the criteria that you want, sub-selecting so it fits in memory, but you limit the number of queries you do (e.g. imagine that you end up selecting a single row with your query, if you have to query this 18M times that is bad).