Process very large (>20GB) text file line by line Process very large (>20GB) text file line by line python python

Process very large (>20GB) text file line by line


It's more idiomatic to write your code like this

def ProcessLargeTextFile():    with open("filepath", "r") as r, open("outfilepath", "w") as w:        for line in r:            x, y, z = line.split(' ')[:3]            w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

def ProcessLargeTextFile():    bunchsize = 1000000     # Experiment with different sizes    bunch = []    with open("filepath", "r") as r, open("outfilepath", "w") as w:        for line in r:            x, y, z = line.split(' ')[:3]            bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))            if len(bunch) == bunchsize:                w.writelines(bunch)                bunch = []        w.writelines(bunch)

suggested by @Janne, an alternative way to generate the lines

def ProcessLargeTextFile():    bunchsize = 1000000     # Experiment with different sizes    bunch = []    with open("filepath", "r") as r, open("outfilepath", "w") as w:        for line in r:            x, y, z, rest = line.split(' ', 3)            bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))            if len(bunch) == bunchsize:                w.writelines(bunch)                bunch = []        w.writelines(bunch)


Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:

  • Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
  • If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
  • Some async io library (Twisted?) might help too.

If you figured out the exact problem, ask again for optimizations of that problem.


As you don't seem to be limited by CPU, but rather by I/O, have you tried with some variations on the third parameter of open?

Indeed, this third parameter can be used to give the buffer size to be used for file operations!

Simply writing open( "filepath", "r", 16777216 ) will use 16 MB buffers when reading from the file. It must help.

Use the same for the output file, and measure/compare with identical file for the rest.

Note: This is the same kind of optimization suggested by other, but you can gain it here for free, without changing your code, without having to buffer yourself.