How to load one line at a time from a pickle file? How to load one line at a time from a pickle file? numpy numpy

How to load one line at a time from a pickle file?


You can write pickles incrementally to a file, which allows you to load themincrementally as well.

Take the following example. Here, we iterate over the items of a list, andpickle each one in turn.

>>> import cPickle>>> myData = [1, 2, 3]>>> f = open('mydata.pkl', 'wb')>>> pickler = cPickle.Pickler(f)>>> for e in myData:...     pickler.dump(e)<cPickle.Pickler object at 0x7f3849818f68><cPickle.Pickler object at 0x7f3849818f68><cPickle.Pickler object at 0x7f3849818f68>>>> f.close()

Now we can do the same process in reverse and load each object as needed. Forthe purpose of example, let's say that we just want the first item and don'twant to iterate over the entire file.

>>> f = open('mydata.pkl', 'rb')>>> unpickler = cPickle.Unpickler(f)>>> unpickler.load()1

At this point, the file stream has only advanced as far as the firstobject. The remaining objects weren't loaded, which is exactly the behavior youwant. For proof, you can try reading the rest of the file and see the rest isstill sitting there.

>>> f.read()'I2\n.I3\n.'


Since you do not know the internal workings of pickle, you need to use another storing method. The script below uses the tobytes() functions to save the data line-wise in a raw file.

Since the length of each line is known, it's offset in the file can be computed and accessed via seek() and read(). After that, it is converted back to an array with the frombuffer() function.

The big disclaimer however is that the size of the array in not saved (this could be added as well but requires some more complications) and that this method might not be as portable as a pickled array.

As @PadraicCunningham pointed out in his comment, a memmap is likely to be an alternative and elegant solution.

Remark on performance: After reading the comments I did a short benchmark. On my machine (16GB RAM, encrypted SSD) I was able to do 40000 random line reads in 24 seconds (with a 20000x40000 matrix of course, not the 10x10 from the example).

from __future__ import print_functionimport numpyimport randomdef dumparray(a, path):    lines, _ = a.shape    with open(path, 'wb') as fd:        for i in range(lines):            fd.write(a[i,...].tobytes())class RandomLineAccess(object):    def __init__(self, path, cols, dtype):        self.dtype = dtype        self.fd = open(path, 'rb')        self.line_length = cols*dtype.itemsize    def read_line(self, line):        offset = line*self.line_length        self.fd.seek(offset)        data = self.fd.read(self.line_length)        return numpy.frombuffer(data, self.dtype)    def close(self):        self.fd.close()def main():    lines = 10    cols = 10    path = '/tmp/array'    a = numpy.zeros((lines, cols))    dtype = a.dtype    for i in range(lines):        # add some data to distinguish lines        numpy.ndarray.fill(a[i,...], i)    dumparray(a, path)    rla = RandomLineAccess(path, cols, dtype)    line_indices = list(range(lines))    for _ in range(20):        line_index = random.choice(line_indices)        print(line_index, rla.read_line(line_index))if __name__ == '__main__':    main()


Thanks everyone. I ended up finding a workaround (a machine with more RAM so I could actually load the dataset into memory).