Pytables/Pandas : Combining (reading?) mutliple HDF5 stores split by rows Pytables/Pandas : Combining (reading?) mutliple HDF5 stores split by rows pandas pandas

Pytables/Pandas : Combining (reading?) mutliple HDF5 stores split by rows


I do a very similar, split-process-combine method, using multiple processes to create intermediate files, then use a single process to merge the resulting files. Here are some tips to get better performance:

  • Turn off indexing while you are writing the files by passing index=False, see here for the docs. I believe that PyTables incrementally updates the index, which in this case is completely unecessary (as you are going to merge them afterwards). Index only the final file. This should speed up the writing quite a bit.

  • You might consider changing the default indexing scheme / level, depending on what your queries are (assume you follow the advice several points below to NOT create too many data columns).

  • In a similar vein, don't create a compressed file when writing the pre-merged files, rather create it AFTER the indexed file is written (in an uncompressed state), so this ends up being your final step. See the docs here. Furthermore, it is very important to pass --chunkshape=auto when using ptrepack which recomputes the PyTables chunksize (e.g. how much data is read/written in a single block), as it will take into account the entire table.

  • RE compression, YMMV may vary here, depending on how well your data actually compresses, and what kinds of queries you are doing. I have some types of data that I find it is faster to NOT compress at all even though in theory it should be better. You have to just experiment (though I always do use blosc). Blosc only has one compression level (its either on for levels 1-9 or off for level 0). So changing this will not change anything.

  • I merge the files in the indexed order, basically by reading a subset of the pre-merge files into memory (a constant number to use only a constant amount of memory), then append them one-by-one to the final file. (not 100% sure this makes a difference but seems to work well).

  • You will find that the vast majority of your time is spent creating the index.

  • Furthermore, only index the columns that you actually need! by making sure to specify data_columns=a_small_subset_of_columns when writing each file.

  • I find that writing a lot of smallish files is better, then merging to create a largish file, rather than writing a few large files, but YMMV here. (e.g. say 100 100MB pre-merge files to yield a 10GB file, rather than 5 2GB files). Though this may be a function of my processing pipeline as I tend to bottleneck on the processing rather than the actual writing.

  • I have not used, but hear amazing things about using a SSD (sold-state-drive), even if it's relatively small for this kind of thing. You can get an order of magnitude of speedup using one (and compression may change this result).