Convert large csv to hdf5 Convert large csv to hdf5 python python

Convert large csv to hdf5


Use append=True in the call to to_hdf:

import numpy as npimport pandas as pdfilename = '/tmp/test.h5'df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])print(df)#    A  B# 0  0  1# 1  2  3# 2  4  5# 3  6  7# 4  8  9# Save to HDF5df.to_hdf(filename, 'data', mode='w', format='table')del df    # allow df to be garbage collected# Append more datadf2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])df2.to_hdf(filename, 'data', append=True)print(pd.read_hdf(filename, 'data'))

yields

    A   B0   0   11   2   32   4   53   6   74   8   90   0  101  20  302  40  503  60  704  80  90

Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.

Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.


Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:

import numpy as npimport pandas as pdfilename = '/tmp/test.h5'store = pd.HDFStore(filename)for i in range(2):    df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])    store.append('data', df)store.close()store = pd.HDFStore(filename)data = store['data']print(data)store.close()

yields

    A   B0   0   11   2   32   4   53   6   74   8   90   0  101  20  302  40  503  60  704  80  90


This should be possible with PyTables. You'll need to use the EArray class though.

As an example, the following is a script I wrote to import chunked training data stored as .npy files into a single .h5 file.

import numpyimport tablesimport ostraining_data = tables.open_file('nn_training.h5', mode='w')a = tables.Float64Atom()bl_filter = tables.Filters(5, 'blosc')   # fast compressor at a moderate settingtraining_input =  training_data.create_earray(training_data.root, 'X', a,                                             (0, 1323), 'Training Input',                                             bl_filter, 4000000)training_output = training_data.create_earray(training_data.root, 'Y', a,                                             (0, 27), 'Training Output',                                             bl_filter, 4000000)for filename in os.listdir('input'):    print "loading {}...".format(filename)    a = numpy.load(os.path.join('input', filename))    print "writing to h5"    training_input.append(a)for filename in os.listdir('output'):    print "loading {}...".format(filename)    training_output.append(numpy.load(os.path.join('output', filename)))

Take a look at the docs for detailed instructions, but very briefly, the create_earray function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0 in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.

Once the array is created, you can use its append method in the expected way.