How to input a list of lists with different sizes in tf.data.Dataset How to input a list of lists with different sizes in tf.data.Dataset python python

How to input a list of lists with different sizes in tf.data.Dataset


You can use tf.data.Dataset.from_generator() to convert any iterable Python object (like a list of lists) into a Dataset:

t = [[4, 2], [3, 4, 5]]dataset = tf.data.Dataset.from_generator(lambda: t, tf.int32, output_shapes=[None])iterator = dataset.make_one_shot_iterator()next_element = iterator.get_next()with tf.Session() as sess:  print(sess.run(next_element))  # ==> '[4, 2]'  print(sess.run(next_element))  # ==> '[3, 4, 5]'


For those working with TensorFlow 2 and looking for an answerI found the following to work directly with ragged tensors.which should be much faster than generator, as long as the entire dataset fits in memory.

t = [[[4,2]],     [[3,4,5]]]rt=tf.ragged.constant(t)dataset = tf.data.Dataset.from_tensor_slices(rt)for x in dataset:  print(x)

produces

<tf.RaggedTensor [[4, 2]]><tf.RaggedTensor [[3, 4, 5]]>

For some reason, it's very particular about having at least 2 dimensions on the individual arrays.


I don't think tensorflow supports tensors with varying numbers of elements along a given dimension.

However, a simple solution is to pad the nested lists with trailing zeros (where necessary):

t = [[4,2], [3,4,5]]max_length = max(len(lst) for lst in t)t_pad = [lst + [0] * (max_length - len(lst)) for lst in t]print(t_pad)dataset = tf.data.Dataset.from_tensor_slices(t_pad)print(dataset)

Outputs:

[[4, 2, 0], [3, 4, 5]]<TensorSliceDataset shapes: (3,), types: tf.int32>

The zeros shouldn't be a big problem for the model: semantically they're just extra sentences of size zero at the end of each list of actual sentences.