How to read a large file - line by line?
The correct, fully Pythonic way to read a file is the following:
with open(...) as f: for line in f: # Do something with 'line'
with statement handles opening and closing the file, including if an exception is raised in the inner block. The
for line in f treats the file object
f as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.
There should be one -- and preferably only one -- obvious way to do it.
Two memory efficient ways in ranked order (first is best) -
- use of
with- supported from python 2.5 and above
- use of
yieldif you really want to have control over how much to read
1. use of
with is the nice and efficient pythonic way to read large files. advantages - 1) file object is automatically closed after exiting from
with execution block. 2) exception handling inside the
with block. 3) memory
for loop iterates through the
f file object line by line. internally it does buffered IO (to optimized on costly IO operations) and memory management.
with open("x.txt") as f: for line in f: do something with data
2. use of
def readInChunks(fileObj, chunkSize=2048): """ Lazy function to read a file piece by piece. Default chunk size: 2kB. """ while True: data = fileObj.read(chunkSize) if not data: break yield dataf = open('bigFile')for chunk in readInChunks(f): do_something(chunk)f.close()
Pitfalls and for the sake of completeness - below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.
In Python, the most common way to read lines from a file is to do the following:
for line in open('myfile','r').readlines(): do_something(line)
When this is done, however, the
readlines() function (same applies for
read() function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods above are the best) for large files is to use the
fileinput module, as follows:
import fileinputfor line in fileinput.input(['myfile']): do_something(line)
fileinput.input() call reads lines sequentially, but doesn't keep them in memory after they've been read or even simply so this, since
file in python is iterable.
To strip newlines:
with open(file_path, 'rU') as f: for line_terminated in f: line = line_terminated.rstrip('\n') ...
With universal newline support all text file lines will seem to be terminated with
'\n', whatever the terminators in the file,
EDIT - To specify universal newline support:
- Python 2 on Unix -
open(file_path, mode='rU')- required [thanks @Dave]
- Python 2 on Windows -
open(file_path, mode='rU')- optional
- Python 3 -
open(file_path, newline=None)- optional
newline parameter is only supported in Python 3 and defaults to
mode parameter defaults to
'r' in all cases. The
U is deprecated in Python 3. In Python 2 on Windows some other mechanism appears to translate
To preserve native line terminators:
with open(file_path, 'rb') as f: with line_native_terminated in f: ...
Binary mode can still parse the file into lines with
in. Each line will have whatever terminators it has in the file.