Feeding .npy (numpy files) into tensorflow data pipeline Feeding .npy (numpy files) into tensorflow data pipeline numpy numpy

Feeding .npy (numpy files) into tensorflow data pipeline


It is actually possible to read directly NPY files with TensorFlow instead of TFRecords. The key pieces are tf.data.FixedLengthRecordDataset and tf.io.decode_raw, along with a look at the documentation of the NPY format. For simplicity, let's suppose that a float32 NPY file containing an array with shape (N, K) is given, and you know the number of features K beforehand, as well as the fact that it is a float32 array. An NPY file is just a binary file with a small header and followed by the raw array data (object arrays are different, but we're considering numbers now). In short, you can find the size of this header with a function like this:

def npy_header_offset(npy_path):    with open(str(npy_path), 'rb') as f:        if f.read(6) != b'\x93NUMPY':            raise ValueError('Invalid NPY file.')        version_major, version_minor = f.read(2)        if version_major == 1:            header_len_size = 2        elif version_major == 2:            header_len_size = 4        else:            raise ValueError('Unknown NPY file version {}.{}.'.format(version_major, version_minor))        header_len = sum(b << (8 * i) for i, b in enumerate(f.read(header_len_size)))        header = f.read(header_len)        if not header.endswith(b'\n'):            raise ValueError('Invalid NPY file.')        return f.tell()

With this you can create a dataset like this:

import tensorflow as tfnpy_file = 'my_file.npy'num_features = ...dtype = tf.float32header_offset = npy_header_offset(npy_file)dataset = tf.data.FixedLengthRecordDataset([npy_file], num_features * dtype.size, header_bytes=header_offset)

Each element of this dataset contains a long string of bytes representing a single example. You can now decode it to obtain an actual array:

dataset = dataset.map(lambda s: tf.io.decode_raw(s, dtype))

The elements will have indeterminate shape, though, because TensorFlow does not keep track of the length of the strings. You can just enforce the shape since you know the number of features:

dataset = dataset.map(lambda s: tf.reshape(tf.io.decode_raw(s, dtype), (num_features,)))

Similarly, you can choose to perform this step after batching, or combine it in whatever way you feel like.

The limitation is that you had to know the number of features in advance. It is possible to extract it from the NumPy header, though, just a bit of a pain, and in any case very hardly from within TensorFlow, so the file names would need to be known in advance. Another limitation is that, as it is, the solution requires you to either use only one file per dataset or files that have the same header size, although if you know that all the arrays have the same size that should actually be the case.

Admittedly, if one considers this kind of approach it may just be better to have a pure binary file without headers, and either hard code the number of features or read them from a different source...


You can do it with tf.py_func, see the example here.The parse function would simply decode the filename from bytes to string and call np.load.

Update: something like this:

def read_npy_file(item):    data = np.load(item.decode())    return data.astype(np.float32)file_list = ['/foo/bar.npy', '/foo/baz.npy']dataset = tf.data.Dataset.from_tensor_slices(file_list)dataset = dataset.map(        lambda item: tuple(tf.py_func(read_npy_file, [item], [tf.float32,])))


Does your data fit into memory? If so, you can follow the instructions from the Consuming NumPy Arrays section of the docs:

Consuming NumPy arrays

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices().

# Load the training data into two NumPy arrays, for example using `np.load()`.with np.load("/var/data/training_data.npy") as data:  features = data["features"]  labels = data["labels"]# Assume that each row of `features` corresponds to the same row as `labels`.assert features.shape[0] == labels.shape[0]dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In the case that the file doesn't fit into memory, it seems like the only recommended approach is to first convert the npy data into a TFRecord format, and then use the TFRecord data set format, which can be streamed without fully loading into memory.

Here is a post with some instructions.

FWIW, it seems crazy to me that TFRecord cannot be instantiated with a directory name or file name(s) of npy files directly, but it appears to be a limitation of plain Tensorflow.

If you can split the single large npy file into smaller files that each roughly represent one batch for training, then you could write a custom data generator in Keras that would yield only the data needed for the current batch.

In general, if your dataset cannot fit in memory, storing it as one single large npy file makes it very hard to work with, and preferably you should reformat the data first, either as TFRecord or as multiple npy files, and then use other methods.