Convert large csv to hdf5

Use append=True in the call to to_hdf:

import numpy as npimport pandas as pdfilename = '/tmp/test.h5'df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])print(df)#    A  B# 0  0  1# 1  2  3# 2  4  5# 3  6  7# 4  8  9# Save to HDF5df.to_hdf(filename, 'data', mode='w', format='table')del df    # allow df to be garbage collected# Append more datadf2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])df2.to_hdf(filename, 'data', append=True)print(pd.read_hdf(filename, 'data'))

yields

    A   B0   0   11   2   32   4   53   6   74   8   90   0  101  20  302  40  503  60  704  80  90

Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.

Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.

Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:

import numpy as npimport pandas as pdfilename = '/tmp/test.h5'store = pd.HDFStore(filename)for i in range(2):    df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])    store.append('data', df)store.close()store = pd.HDFStore(filename)data = store['data']print(data)store.close()

yields

    A   B0   0   11   2   32   4   53   6   74   8   90   0  101  20  302  40  503  60  704  80  90

python csv pandas hdf5 pytables

This should be possible with PyTables. You'll need to use the EArray class though.

As an example, the following is a script I wrote to import chunked training data stored as .npy files into a single .h5 file.

import numpyimport tablesimport ostraining_data = tables.open_file('nn_training.h5', mode='w')a = tables.Float64Atom()bl_filter = tables.Filters(5, 'blosc')   # fast compressor at a moderate settingtraining_input =  training_data.create_earray(training_data.root, 'X', a,                                             (0, 1323), 'Training Input',                                             bl_filter, 4000000)training_output = training_data.create_earray(training_data.root, 'Y', a,                                             (0, 27), 'Training Output',                                             bl_filter, 4000000)for filename in os.listdir('input'):    print "loading {}...".format(filename)    a = numpy.load(os.path.join('input', filename))    print "writing to h5"    training_input.append(a)for filename in os.listdir('output'):    print "loading {}...".format(filename)    training_output.append(numpy.load(os.path.join('output', filename)))

Take a look at the docs for detailed instructions, but very briefly, the create_earray function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0 in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.

Once the array is created, you can use its append method in the expected way.

CodeHunter

Convert large csv to hdf5

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last