Improve pandas (PyTables?) HDF5 table write performance Improve pandas (PyTables?) HDF5 table write performance python python

Improve pandas (PyTables?) HDF5 table write performance


Here is a similar comparison I just did. Its about 1/3 of the data 10M rows. The final size is abou 1.3GB

I define 3 timing functions:

Test the Fixed format (called Storer in 0.12). This writes in a PyTables Array format

def f(df):    store = pd.HDFStore('test.h5','w')    store['df'] = df    store.close()

Write in the Table format, using PyTables Table format. Do not create an index.

def f2(df):    store = pd.HDFStore('test.h5','w')    store.append('df',df,index=False)    store.close()

Same as f2, but create an index (which is normally done)

def f3(df):    store = pd.HDFStore('test.h5','w')    store.append('df',df)    store.close()

Create the frame

In [25]: df = concat([DataFrame(np.random.randn(10000000,10)),DataFrame(np.random.randint(0,10,size=50000000).reshape(10000000,5))],axis=1)In [26]: dfOut[26]:<class 'pandas.core.frame.DataFrame'>Int64Index: 10000000 entries, 0 to 9999999Columns: 15 entries, 0 to 4dtypes: float64(10), int64(5)v0.12.0In [27]: %timeit f(df)1 loops, best of 3: 14.7 s per loopIn [28]: %timeit f2(df)1 loops, best of 3: 32 s per loopIn [29]: %timeit f3(df)1 loops, best of 3: 40.1 s per loopmaster/v0.13.0In [5]: %timeit f(df)1 loops, best of 3: 12.9 s per loopIn [6]: %timeit f2(df)1 loops, best of 3: 17.5 s per loopIn [7]: %timeit f3(df)1 loops, best of 3: 24.3 s per loop

Timing Runs with the same file as provided by the OP (link is below)

In [4]: df = pd.read_hdf('test.h5','df')In [5]: dfOut[5]: <class 'pandas.core.frame.DataFrame'>Int64Index: 28880943 entries, 0 to 28880942Columns: 14 entries, node_id to kernel_typedtypes: float64(4), int64(10)

Like f1, Fixed format

In [6]: %timeit df.to_hdf('test.hdf','df',mode='w')1 loops, best of 3: 36.2 s per loop

Like f2, Table format, no index

In [7]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False)1 loops, best of 3: 45 s per loopIn [8]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False,chunksize=2000000)1 loops, best of 3: 44.5 s per loop

Like f3, Table format with index

In [9]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000)1 loops, best of 3: 1min 36s per loop

Like f3, Table format with index, compressed with blosc

In [10]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000,complib='blosc')1 loops, best of 3: 46.5 s per loopIn [11]: %timeit pd.read_hdf('test.hdf','df')1 loops, best of 3: 10.8 s per loop

Show original file (test.h5, and compressed, test.hdf)

In [13]: !ls -ltr test.h*-rw-r--r-- 1 jreback users 3471518282 Nov 20 18:20 test.h5-rw-rw-r-- 1 jreback users  649327780 Nov 20 21:17 test.hdf

Several points to note.

  • Not creating an index can make a non-trivial difference in time. I also believe that if you have a string based index it can substantially worsen write time. That said, you always want to create an index to make retrieval very fast.

    You didn't include what your index is, nor whether its sorted (though I only think this makes a small difference).

  • the write penalty in my examples are roughly 2x (though I have seen it be somewhat bigger when INCLUDING the index time). Thus your 7s (1/2 of my time), for 3x the number I am writing is quite suspect. I am using a reasonably fast disk array. If you were using a flash based disk then this is possible, though.

  • master/v0.13.0 (release very soon), improves the write times on tables substantially.

  • you can try setting the chunksize parameter to a bigger number when you write the data (its default is 100000). The purpose of the 'relatively' low number is to have a constant memory usage. (e.g. if is bigger you will use more memory, in theory it should write faster though).

  • Tables offer 2 advantages over Fixed format: 1) query retrieval, and 2) appendability. Reading the entire table doesn't take advantage of either, so if you ONLY want to read the entire table, then Fixed format is recommended. (In my experience the flexibility of Tables greatly outweights the write penalty, but YMMV)

Bottom line is to repeat the timings (use ipython as it will run multiple tests). If you can reproduce your results, then pls post a %prun and I'll take a look.

Update:

so the recommended way for a table this size is to compress with blosc and use pandas master/0.13.0 along with PyTables 3.0.0


That's an interesting discussion. I think Peter is getting awesome performance for the Fixed format because the format writes in a single shot and also that he has a really good SSD (it can write at more than 450 MB/s).

Appending to table is a more complex operation (the dataset has to be enlarged, and new records must be checked so that we can ensure that they follow the schema of the table). This is why appending rows in tables is generally slower (but still, Jeff is getting ~ 70 MB/s, which is pretty good). That Jeff is getting more speed than Peter is probably due to the fact that he has a better processor.

Finally, indexing in PyTables uses a single processor, yes, and that normally is an expensive operation, so you should really disable it if you are not going to query data on-disk.