Fastest save and load options for a numpy array Fastest save and load options for a numpy array arrays arrays

Fastest save and load options for a numpy array


For really big arrays, I've heard about several solutions, and they mostly on being lazy on the I/O :

  • NumPy.memmap, maps big arrays to binary form
    • Pros :
      • No dependency other than Numpy
      • Transparent replacement of ndarray (Any class accepting ndarray accepts memmap )
    • Cons :
      • Chunks of your array are limited to 2.5G
      • Still limited by Numpy throughput
  • Use Python bindings for HDF5, a bigdata-ready file format, like PyTables or h5py

    • Pros :
      • Format supports compression, indexing, and other super nice features
      • Apparently the ultimate PetaByte-large file format
    • Cons :
      • Learning curve of having a hierarchical format ?
      • Have to define what your performance needs are (see later)
  • Python's pickling system (out of the race, mentioned for Pythonicity rather than speed)

    • Pros:
      • It's Pythonic ! (haha)
      • Supports all sorts of objects
    • Cons:
      • Probably slower than others (because aimed at any objects not arrays)

Numpy.memmap

From the docs of NumPy.memmap :

Create a memory-map to an array stored in a binary file on disk.

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory

The memmap object can be used anywhere an ndarray is accepted. Given any memmap fp , isinstance(fp, numpy.ndarray) returns True.


HDF5 arrays

From the h5py doc

Lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

The format supports compression of data in various ways (more bits loaded for same I/O read), but this means that the data becomes less easy to query individually, but in your case (purely loading / dumping arrays) it might be efficient


Here is a comparison with PyTables.

I cannot get up to (int(1e3), int(1e6) due to memory restrictions.Therefore, I used a smaller array:

data = np.random.random((int(1e3), int(1e5)))

NumPy save:

%timeit np.save('array.npy', data)1 loops, best of 3: 4.26 s per loop

NumPy load:

%timeit data2 = np.load('array.npy')1 loops, best of 3: 3.43 s per loop

PyTables writing:

%%timeitwith tables.open_file('array.tbl', 'w') as h5_file:    h5_file.create_array('/', 'data', data)1 loops, best of 3: 4.16 s per loop

PyTables reading:

 %%timeit with tables.open_file('array.tbl', 'r') as h5_file:      data2 = h5_file.root.data.read() 1 loops, best of 3: 3.51 s per loop

The numbers are very similar. So no real gain wit PyTables here.But we are pretty close to the maximum writing and reading rate of my SSD.

Writing:

Maximum write speed: 241.6 MB/sPyTables write speed: 183.4 MB/s

Reading:

Maximum read speed: 250.2PyTables read speed: 217.4

Compression does not really help due to the randomness of the data:

%%timeitFILTERS = tables.Filters(complib='blosc', complevel=5)with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:    h5_file.create_carray('/', 'data', obj=data)1 loops, best of 3: 4.08 s per loop

Reading of the compressed data becomes a bit slower:

%%timeitwith tables.open_file('array.tbl', 'r') as h5_file:    data2 = h5_file.root.data.read()1 loops, best of 3: 4.01 s per loop

This is different for regular data:

 reg_data = np.ones((int(1e3), int(1e5)))

Writing is significantly faster:

%%timeitFILTERS = tables.Filters(complib='blosc', complevel=5)with tables.open_file('array.tbl', mode='w', filters=FILTERS) as h5_file:    h5_file.create_carray('/', 'reg_data', obj=reg_data)

1 loops, best of 3: 849 ms per loop

The same holds true for reading:

%%timeitwith tables.open_file('array.tbl', 'r') as h5_file:    reg_data2 = h5_file.root.reg_data.read()1 loops, best of 3: 1.7 s per loop

Conclusion: The more regular your data the faster it should get using PyTables.


I've compared a few methods using perfplot (one of my projects). Here are the results:

Writing

enter image description here

For large arrays, all methods are about equally fast. The file sizes are also equal which is to be expected since the input array are random doubles and hence hardly compressible.

Code to reproduce the plot:

import perfplotimport pickleimport numpyimport h5pyimport tablesimport zarrdef npy_write(data):    numpy.save("npy.npy", data)def hdf5_write(data):    f = h5py.File("hdf5.h5", "w")    f.create_dataset("data", data=data)def pickle_write(data):    with open("test.pkl", "wb") as f:        pickle.dump(data, f)def pytables_write(data):    f = tables.open_file("pytables.h5", mode="w")    gcolumns = f.create_group(f.root, "columns", "data")    f.create_array(gcolumns, "data", data, "data")    f.close()def zarr_write(data):    zarr.save("out.zarr", data)perfplot.save(    "write.png",    setup=numpy.random.rand,    kernels=[npy_write, hdf5_write, pickle_write, pytables_write, zarr_write],    n_range=[2 ** k for k in range(28)],    xlabel="len(data)",    equality_check=None,)

Reading

enter image description here

pickles, pytables and hdf5 are roughly equally fast; pickles and zarr are slower for large arrays.

Code to reproduce the plot:

import perfplotimport pickleimport numpyimport h5pyimport tablesimport zarrdef setup(n):    data = numpy.random.rand(n)    # write all files    #    numpy.save("out.npy", data)    #    f = h5py.File("out.h5", "w")    f.create_dataset("data", data=data)    f.close()    #    with open("test.pkl", "wb") as f:        pickle.dump(data, f)    #    f = tables.open_file("pytables.h5", mode="w")    gcolumns = f.create_group(f.root, "columns", "data")    f.create_array(gcolumns, "data", data, "data")    f.close()    #    zarr.save("out.zip", data)def npy_read(data):    return numpy.load("out.npy")def hdf5_read(data):    f = h5py.File("out.h5", "r")    out = f["data"][()]    f.close()    return outdef pickle_read(data):    with open("test.pkl", "rb") as f:        out = pickle.load(f)    return outdef pytables_read(data):    f = tables.open_file("pytables.h5", mode="r")    out = f.root.columns.data[()]    f.close()    return outdef zarr_read(data):    return zarr.load("out.zip")perfplot.show(    setup=setup,    kernels=[        npy_read,        hdf5_read,        pickle_read,        pytables_read,        zarr_read,    ],    n_range=[2 ** k for k in range(28)],    xlabel="len(data)",)