numpy: efficiently reading a large array numpy: efficiently reading a large array numpy numpy

numpy: efficiently reading a large array


NumPy provides fromfile() to read binary data.

a = numpy.fromfile("filename", dtype=numpy.float32)

will create a one-dimensional array containing your data. To access it as a two-dimensional Fortran-ordered n x m matrix, you can reshape it:

a = a.reshape((n, m), order="FORTRAN")

[EDIT: The reshape() actually copies the data in this case (see the comments). To do it without cpoying, use

a = a.reshape((m, n)).T

Thanks to Joe Kingtion for pointing this out.]

But to be honest, if your matrix has several gigabytes, I would go for a HDF5 tool like h5py or PyTables. Both of the tools have FAQ entries comparing the tool to the other one. I generally prefer h5py, though PyTables seems to be more commonly used (and the scopes of both projects are slightly different).

HDF5 files can be written from most programming language used in data analysis. The list of interfaces in the linked Wikipedia article is not complete, for example there is also an R interface. But I actually don't know which language you want to use to write the data...


Basically Numpy stores the arrays as flat vectors. The multiple dimensions are just an illusion created by different views and strides that the Numpy iterator uses.

For a thorough but easy to follow explanation how Numpy internally works, see the excellent chapter 19 on The Beatiful Code book.

At least Numpy array() and reshape() have an argument for C ('C'), Fortran ('F') or preserved order ('A').Also see the question How to force numpy array order to fortran style?

An example with the default C indexing (row-major order):

>>> a = np.arange(12).reshape(3,4) # <- C order by default>>> aarray([[ 0,  1,  2,  3],       [ 4,  5,  6,  7],       [ 8,  9, 10, 11]])>>> a[1]array([4, 5, 6, 7])>>> a.strides(32, 8)

Indexing using Fortran order (column-major order):

>>> a = np.arange(12).reshape(3,4, order='F')>>> aarray([[ 0,  3,  6,  9],       [ 1,  4,  7, 10],       [ 2,  5,  8, 11]])>>> a[1]array([ 1,  4,  7, 10])>>> a.strides(8, 24)

The other view

Also, you can always get the other kind of view using the parameter T of an array:

>>> a = np.arange(12).reshape(3,4, order='C')>>> a.Tarray([[ 0,  4,  8],       [ 1,  5,  9],       [ 2,  6, 10],       [ 3,  7, 11]])>>> a = np.arange(12).reshape(3,4, order='F')>>> a.Tarray([[ 0,  1,  2],       [ 3,  4,  5],       [ 6,  7,  8],       [ 9, 10, 11]])

You can also manually set the strides:

>>> a = np.arange(12).reshape(3,4, order='C')>>> aarray([[ 0,  1,  2,  3],       [ 4,  5,  6,  7],       [ 8,  9, 10, 11]])>>> a.strides(32, 8)>>> a.strides = (8, 24)>>> aarray([[ 0,  3,  6,  9],       [ 1,  4,  7, 10],       [ 2,  5,  8, 11]])