Creating very large NUMPY arrays in small chunks (PyTables vs. numpy.memmap) Creating very large NUMPY arrays in small chunks (PyTables vs. numpy.memmap) numpy numpy

Creating very large NUMPY arrays in small chunks (PyTables vs. numpy.memmap)


That's weird. The np.memmap should work. I've been using it with 250Gb data on a 12Gb RAM machine without problems.

Does the system really runs out of memory at the very moment of the creation of the memmap file? Or it happens along the code? If it happens at the file creation I really don't know what the problem would be.

When I started using memmap I've made some mistakes that led me to memory run out. For me, something like the below code should work:

mmapData = np.memmap(mmapFile, mode='w+', shape = (smallarray_size,number_of_arrays), dtype ='float64')for k in range(number_of_arrays):  smallarray = np.fromfile(list_of_files[k]) # list_of_file is the list with the files name  smallarray = do_something_with_array(smallarray)  mmapData[:,k] = smallarray

It may not be the most efficient way, but it seems to me that it would have the lowest memory usage.

Ps: Be aware that the default dtype value for memmap(int) and fromfile(float) are different!


HDF5 is a C library that can efficiently store large on-disk arrays. Both PyTables and h5py are Python libraries on top of HDF5. If you're using tabular data then PyTables might be preferred; if you have just plain arrays then h5py is probably more stable/simpler.

There are out-of-core numpy array solutions that handle the chunking for you. Dask.array would give you plain numpy semantics on top of your collection of chunked files (see docs on stacking.)