Lazy Method for Reading Big File in Python? Lazy Method for Reading Big File in Python? python python

Lazy Method for Reading Big File in Python?


To write a lazy function, just use yield:

def read_in_chunks(file_object, chunk_size=1024):    """Lazy function (generator) to read a file piece by piece.    Default chunk size: 1k."""    while True:        data = file_object.read(chunk_size)        if not data:            break        yield datawith open('really_big_file.dat') as f:    for piece in read_in_chunks(f):        process_data(piece)

Another option would be to use iter and a helper function:

f = open('really_big_file.dat')def read1k():    return f.read(1024)for piece in iter(read1k, ''):    process_data(piece)

If the file is line-based, the file object is already a lazy generator of lines:

for line in open('really_big_file.dat'):    process_data(line)


If your computer, OS and python are 64-bit, then you can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:

import mmapwith open("hello.txt", "r+") as f:    # memory-map the file, size 0 means whole file    map = mmap.mmap(f.fileno(), 0)    # read content via standard file methods    print map.readline()  # prints "Hello Python!"    # read content via slice notation    print map[:5]  # prints "Hello"    # update content using slice notation;    # note that new content must have same size    map[6:] = " world!\n"    # ... and read again using standard file methods    map.seek(0)    print map.readline()  # prints "Hello  world!"    # close the map    map.close()

If either your computer, OS or python are 32-bit, then mmap-ing large files can reserve large parts of your address space and starve your program of memory.


file.readlines() takes in an optional size argument which approximates the number of lines read in the lines returned.

bigfile = open('bigfilename','r')tmp_lines = bigfile.readlines(BUF_SIZE)while tmp_lines:    process([line for line in tmp_lines])    tmp_lines = bigfile.readlines(BUF_SIZE)