Process very large (>20GB) text file line by line

python line

It's more idiomatic to write your code like this

def ProcessLargeTextFile():    with open("filepath", "r") as r, open("outfilepath", "w") as w:        for line in r:            x, y, z = line.split(' ')[:3]            w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

def ProcessLargeTextFile():    bunchsize = 1000000     # Experiment with different sizes    bunch = []    with open("filepath", "r") as r, open("outfilepath", "w") as w:        for line in r:            x, y, z = line.split(' ')[:3]            bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))            if len(bunch) == bunchsize:                w.writelines(bunch)                bunch = []        w.writelines(bunch)

suggested by @Janne, an alternative way to generate the lines

def ProcessLargeTextFile():    bunchsize = 1000000     # Experiment with different sizes    bunch = []    with open("filepath", "r") as r, open("outfilepath", "w") as w:        for line in r:            x, y, z, rest = line.split(' ', 3)            bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))            if len(bunch) == bunchsize:                w.writelines(bunch)                bunch = []        w.writelines(bunch)

python line

Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:

Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
Some async io library (Twisted?) might help too.

If you figured out the exact problem, ask again for optimizations of that problem.

python line

As you don't seem to be limited by CPU, but rather by I/O, have you tried with some variations on the third parameter of open?

Indeed, this third parameter can be used to give the buffer size to be used for file operations!

Simply writing open( "filepath", "r", 16777216 ) will use 16 MB buffers when reading from the file. It must help.

Use the same for the output file, and measure/compare with identical file for the rest.

Note: This is the same kind of optimization suggested by other, but you can gain it here for free, without changing your code, without having to buffer yourself.

CodeHunter

Process very large (>20GB) text file line by line

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last