Experience with using h5py to do analytical work on big data in Python? Experience with using h5py to do analytical work on big data in Python? python python

Experience with using h5py to do analytical work on big data in Python?


We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.

HDF5 advantages:

  • data can be inspected conveniently using the h5view application, h5py/ipython and the h5* commandline tools
  • APIs are available for different platforms and languages
  • structure data using groups
  • annotating data using attributes
  • worry-free built-in data compression
  • io on single datasets is fast

HDF5 pitfalls:

  • Performance breaks down, if a h5 file contains too many datasets/groups (> 1000), because traversing them is very slow. On the other side, io is fast for a few big datasets.
  • Advanced data queries (SQL like) are clumsy to implement and slow (consider SQLite in that case)
  • HDF5 is not thread-safe in all cases: one has to ensure, that the library was compiled with the correct options
  • changing h5 datasets (resize, delete etc.) blows up the file size (in the best case) or is impossible (in the worst case) (the whole h5 file has to be copied to flatten it again)


This is a long comment, not an answer to your actual question about h5py. I don't use Python for stats and tend to deal with relatively small datasets, but it might be worth a moment to check out the CRAN Task View for high-performance computing in R, especially the "Large memory and out-of-memory data" section.

Three reasons:

  • you can mine the source code of any of those packages for ideas that might help you generally
  • you might find the package names useful in searching for Python equivalents; a lot of R users are Python users, too
  • under some circumstances, it might prove convenient to just link to R for a particular analysis using one of the above-linked packages and then draw the results back into Python

Again, I emphasize that this is all way out of my league, and it's certainly possible that you might already know all of this. But perhaps this will prove useful to you or someone working on the same problems.