Lazy Method for Reading Big File in Python?
To write a lazy function, just use yield
:
def read_in_chunks(file_object, chunk_size=1024): """Lazy function (generator) to read a file piece by piece. Default chunk size: 1k.""" while True: data = file_object.read(chunk_size) if not data: break yield datawith open('really_big_file.dat') as f: for piece in read_in_chunks(f): process_data(piece)
Another option would be to use iter
and a helper function:
f = open('really_big_file.dat')def read1k(): return f.read(1024)for piece in iter(read1k, ''): process_data(piece)
If the file is line-based, the file object is already a lazy generator of lines:
for line in open('really_big_file.dat'): process_data(line)
If your computer, OS and python are 64-bit, then you can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:
import mmapwith open("hello.txt", "r+") as f: # memory-map the file, size 0 means whole file map = mmap.mmap(f.fileno(), 0) # read content via standard file methods print map.readline() # prints "Hello Python!" # read content via slice notation print map[:5] # prints "Hello" # update content using slice notation; # note that new content must have same size map[6:] = " world!\n" # ... and read again using standard file methods map.seek(0) print map.readline() # prints "Hello world!" # close the map map.close()
If either your computer, OS or python are 32-bit, then mmap-ing large files can reserve large parts of your address space and starve your program of memory.
file.readlines()
takes in an optional size argument which approximates the number of lines read in the lines returned.
bigfile = open('bigfilename','r')tmp_lines = bigfile.readlines(BUF_SIZE)while tmp_lines: process([line for line in tmp_lines]) tmp_lines = bigfile.readlines(BUF_SIZE)