HDF5 taking more space than CSV? HDF5 taking more space than CSV? python python

HDF5 taking more space than CSV?


Copy of my answer from the issue: https://github.com/pydata/pandas/issues/3651

Your sample is really too small. HDF5 has a fair amount of overhead with really small sizes (even 300k entries is on the smaller side). The following is with no compression on either side. Floats are really more efficiently represented in binary (that as a text representation).

In addition, HDF5 is row based. You get MUCH efficiency by having tables that are not too wide but are fairly long. (Hence your example is not very efficient in HDF5 at all, store it transposed in this case)

I routinely have tables that are 10M+ rows and query times can be in the ms. Even the below example is small. Having 10+GB files is quite common (not to mention the astronomy guys who 10GB+ is a few seconds!)

-rw-rw-r--  1 jreback users 203200986 May 19 20:58 test.csv-rw-rw-r--  1 jreback users  88007312 May 19 20:59 test.h5In [1]: df = DataFrame(randn(1000000,10))In [9]: dfOut[9]: <class 'pandas.core.frame.DataFrame'>Int64Index: 1000000 entries, 0 to 999999Data columns (total 10 columns):0    1000000  non-null values1    1000000  non-null values2    1000000  non-null values3    1000000  non-null values4    1000000  non-null values5    1000000  non-null values6    1000000  non-null values7    1000000  non-null values8    1000000  non-null values9    1000000  non-null valuesdtypes: float64(10)In [5]: %timeit df.to_csv('test.csv',mode='w')1 loops, best of 3: 12.7 s per loopIn [6]: %timeit df.to_hdf('test.h5','df',mode='w')1 loops, best of 3: 825 ms per loopIn [7]: %timeit pd.read_csv('test.csv',index_col=0)1 loops, best of 3: 2.35 s per loopIn [8]: %timeit pd.read_hdf('test.h5','df')10 loops, best of 3: 38 ms per loop

I really wouldn't worry about the size (I suspect you are not, but are merely interested, which is fine). The point of HDF5 is that disk is cheap, cpu is cheap, but you can't have everything in memory at once so we optimize by using chunking