Pandas HDF5 as a Database

python database pandas hdf5 pytables

HDF5 works fine for concurrent read only access.
For concurrent write access you either have to use parallel HDF5 or have a worker process that takes care of writing to an HDF5 store.

There are some efforts to combine HDF5 with a RESTful API from the HDF Group intself. See here and here for more details. I am not sure how mature it is.

I recommend to use a hybrid approach and expose it via a RESTful API.
You can store meta-information in a SQL/NoSQL database and keep the raw data (time series data) in one or multiple HDF5 files.

There is one public REST API to access the data and the user doesn't have to care what happens behind the curtains.
That's also the approach we are taking for storing biological information.

python database pandas hdf5 pytables

I know the following is not a good answer to the question, but it is perfect for my needs, and I didn't find it implemented somewhere else:

from pandas import HDFStoreimport osimport timeclass SafeHDFStore(HDFStore):    def __init__(self, *args, **kwargs):        probe_interval = kwargs.pop("probe_interval", 1)        self._lock = "%s.lock" % args[0]        while True:            try:                self._flock = os.open(self._lock, os.O_CREAT |                                                  os.O_EXCL |                                                  os.O_WRONLY)                break            except FileExistsError:                time.sleep(probe_interval)        HDFStore.__init__(self, *args, **kwargs)    def __exit__(self, *args, **kwargs):        HDFStore.__exit__(self, *args, **kwargs)        os.close(self._flock)        os.remove(self._lock)

I use this as

result = do_long_operations()with SafeHDFStore('example.hdf') as store:    # Only put inside this block the code which operates on the store    store['result'] = result

and different processes/threads working on a same store will simply queue.

Notice that if instead you naively operate on the store from multiple processes, the last closing the store will "win", and what the others "think they have written" will be lost.

(I know I could instead just let one process manage all writes, but this solution avoids the overhead of pickling)

EDIT: "probe_interval" can now be tuned (one second is too much if writes are frequent)

python database pandas hdf5 pytables

HDF Group has a REST service for HDF5 out now: http://hdfgroup.org/projects/hdfserver/

CodeHunter

Pandas HDF5 as a Database

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last