Convert large csv to hdf5
Use append=True
in the call to to_hdf
:
import numpy as npimport pandas as pdfilename = '/tmp/test.h5'df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])print(df)# A B# 0 0 1# 1 2 3# 2 4 5# 3 6 7# 4 8 9# Save to HDF5df.to_hdf(filename, 'data', mode='w', format='table')del df # allow df to be garbage collected# Append more datadf2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])df2.to_hdf(filename, 'data', append=True)print(pd.read_hdf(filename, 'data'))
yields
A B0 0 11 2 32 4 53 6 74 8 90 0 101 20 302 40 503 60 704 80 90
Note that you need to use format='table'
in the first call to df.to_hdf
to make the table appendable. Otherwise, the format is 'fixed'
by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True
to build the hdf5 file. Then overwrite the DataFrame or use del df
to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf
, you could append to a HDFStore:
import numpy as npimport pandas as pdfilename = '/tmp/test.h5'store = pd.HDFStore(filename)for i in range(2): df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B']) store.append('data', df)store.close()store = pd.HDFStore(filename)data = store['data']print(data)store.close()
yields
A B0 0 11 2 32 4 53 6 74 8 90 0 101 20 302 40 503 60 704 80 90
This should be possible with PyTables. You'll need to use the EArray class though.
As an example, the following is a script I wrote to import chunked training data stored as .npy
files into a single .h5
file.
import numpyimport tablesimport ostraining_data = tables.open_file('nn_training.h5', mode='w')a = tables.Float64Atom()bl_filter = tables.Filters(5, 'blosc') # fast compressor at a moderate settingtraining_input = training_data.create_earray(training_data.root, 'X', a, (0, 1323), 'Training Input', bl_filter, 4000000)training_output = training_data.create_earray(training_data.root, 'Y', a, (0, 27), 'Training Output', bl_filter, 4000000)for filename in os.listdir('input'): print "loading {}...".format(filename) a = numpy.load(os.path.join('input', filename)) print "writing to h5" training_input.append(a)for filename in os.listdir('output'): print "loading {}...".format(filename) training_output.append(numpy.load(os.path.join('output', filename)))
Take a look at the docs for detailed instructions, but very briefly, the create_earray
function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0
in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.
Once the array is created, you can use its append
method in the expected way.