How to get line count of a large file cheaply in Python?
You can't get any better than that.
After all, any solution will have to read the entire file, figure out how many
\n you have, and return that result.
Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.
I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (
opcount); a simple iteration over the lines in the file (
simplecount); readline with a memory-mapped filed (mmap) (
mapcount); and the buffer read solution offered by Mykola Kharechko (
I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.
Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor
Here are my results:
mapcount : 0.465599966049simplecount : 0.756399965286bufcount : 0.546800041199opcount : 0.718600034714
Edit: numbers for Python 2.6:
mapcount : 0.471799945831simplecount : 0.634400033951bufcount : 0.468800067902opcount : 0.602999973297
So the buffer read strategy seems to be the fastest for Windows/Python 2.6
Here is the code:
from __future__ import with_statementimport timeimport mmapimport randomfrom collections import defaultdictdef mapcount(filename): f = open(filename, "r+") buf = mmap.mmap(f.fileno(), 0) lines = 0 readline = buf.readline while readline(): lines += 1 return linesdef simplecount(filename): lines = 0 for line in open(filename): lines += 1 return linesdef bufcount(filename): f = open(filename) lines = 0 buf_size = 1024 * 1024 read_f = f.read # loop optimization buf = read_f(buf_size) while buf: lines += buf.count('\n') buf = read_f(buf_size) return linesdef opcount(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1counts = defaultdict(list)for i in range(5): for func in [mapcount, simplecount, bufcount, opcount]: start_time = time.time() assert func("big_file.txt") == 1209138 counts[func].append(time.time() - start_time)for key, vals in counts.items(): print key.__name__, ":", sum(vals) / float(len(vals))