save numpy array in append mode
The build-in .npy
file format is perfectly fine for working with small datasets, without relying on external modules other then numpy
.
However, when you start having large amounts of data, the use of a file format, such as HDF5, designed to handle such datasets, is to be preferred [1].
For instance, below is a solution to save numpy
arrays in HDF5 with PyTables,
Step 1: Create an extendable EArray
storage
import tablesimport numpy as npfilename = 'outarray.h5'ROW_SIZE = 100NUM_COLUMNS = 200f = tables.open_file(filename, mode='w')atom = tables.Float64Atom()array_c = f.create_earray(f.root, 'data', atom, (0, ROW_SIZE))for idx in range(NUM_COLUMNS): x = np.random.rand(1, ROW_SIZE) array_c.append(x)f.close()
Step 2: Append rows to an existing dataset (if needed)
f = tables.open_file(filename, mode='a')f.root.data.append(x)
Step 3: Read back a subset of the data
f = tables.open_file(filename, mode='r')print(f.root.data[1:10,2:20]) # e.g. read from disk only this part of the dataset
This is an expansion on Mohit Pandey's answer showing a full save / load example. It was tested using Python 3.6 and Numpy 1.11.3.
from pathlib import Pathimport numpy as npimport osp = Path('temp.npy')with p.open('ab') as f: np.save(f, np.zeros(2)) np.save(f, np.ones(2))with p.open('rb') as f: fsz = os.fstat(f.fileno()).st_size out = np.load(f) while f.tell() < fsz: out = np.vstack((out, np.load(f)))
out = array([[ 0., 0.], [ 1., 1.]])
I made a library to create Numpy .npy
files that are larger than the main memory of the machine by appending on the zero axis. The file can then be read with mmap_mode="r"
.
https://pypi.org/project/npy-append-array
Installation:
pip install npy-append-array
Example:
from npy_append_array import NpyAppendArrayimport numpy as nparr1 = np.array([[1,2],[3,4]])arr2 = np.array([[1,2],[3,4],[5,6]])filename='out.npy'# optional, .append will create file automatically if not existsnp.save(filename, arr1)npaa = NpyAppendArray(filename)npaa.append(arr2)npaa.append(arr2)npaa.append(arr2)data = np.load(filename, mmap_mode="r")print(data)