Why is "slurping" a file not a good practice? Why is "slurping" a file not a good practice? ruby ruby

Why is "slurping" a file not a good practice?


Again and again we see questions asking about reading a text file to process it line-by-line, that use variations of read, or readlines, which pull the entire file into memory in one action.

The documentation for read says:

Opens the file, optionally seeks to the given offset, then returns length bytes (defaulting to the rest of the file). [...]

The documentation for readlines says:

Reads the entire file specified by name as individual lines, and returns those lines in an array. [...]

Pulling in a small file is no big deal, but there comes a point where memory has to be shuffled around as the incoming data's buffer grows, and that eats CPU time. In addition, if the data consumes too much space, the OS has to get involved just to keep the script running and starts spooling to disk, which will take a program to its knees. On a HTTPd (web-host) or something needing fast response it'll cripple the entire application.

Slurping is usually based on a misunderstanding of the speed of file I/O or thinking that it's better to read then split the buffer than it is to read it a single line at a time.

Here's some test code to demonstrate the problem caused by "slurping".

Save this as "test.sh":

echo Building test files...yes "abcdefghijklmnopqrstuvwxyz 123456890" | head -c 1000       > kb.txtyes "abcdefghijklmnopqrstuvwxyz 123456890" | head -c 1000000    > mb.txtyes "abcdefghijklmnopqrstuvwxyz 123456890" | head -c 1000000000 > gb1.txtcat gb1.txt gb1.txt > gb2.txtcat gb1.txt gb2.txt > gb3.txtecho Testing...ruby -vechofor i in kb.txt mb.txt gb1.txt gb2.txt gb3.txtdo  echo  echo "Running: time ruby readlines.rb $i"  time ruby readlines.rb $i  echo '---------------------------------------'  echo "Running: time ruby foreach.rb $i"  time ruby foreach.rb $i  echodonerm [km]b.txt gb[123].txt 

It creates five files of increasing sizes. 1K files are easily processed, and are very common. It used to be that 1MB files were considered big, but they're common now. 1GB is common in my environment, and files beyond 10GB are encountered periodically, so knowing what happens at 1GB and beyond is very important.

Save this as "readlines.rb". It doesn't do anything but read the entire file line-by-line internally, and append it to an array that is then returned, and seems like it'd be fast since it's all written in C:

lines = File.readlines(ARGV.shift).sizeputs "#{ lines } lines read"

Save this as "foreach.rb":

lines = 0File.foreach(ARGV.shift) { |l| lines += 1 }puts "#{ lines } lines read"

Running sh ./test.sh on my laptop I get:

Building test files...Testing...ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin13.0]

Reading the 1K file:

Running: time ruby readlines.rb kb.txt28 lines readreal    0m0.998suser    0m0.386ssys 0m0.594s---------------------------------------Running: time ruby foreach.rb kb.txt28 lines readreal    0m1.019suser    0m0.395ssys 0m0.616s

Reading the 1MB file:

Running: time ruby readlines.rb mb.txt27028 lines readreal    0m1.021suser    0m0.398ssys 0m0.611s---------------------------------------Running: time ruby foreach.rb mb.txt27028 lines readreal    0m0.990suser    0m0.391ssys 0m0.591s

Reading the 1GB file:

Running: time ruby readlines.rb gb1.txt27027028 lines readreal    0m19.407suser    0m17.134ssys 0m2.262s---------------------------------------Running: time ruby foreach.rb gb1.txt27027028 lines readreal    0m10.378suser    0m9.472ssys 0m0.898s

Reading the 2GB file:

Running: time ruby readlines.rb gb2.txt54054055 lines readreal    0m58.904suser    0m54.718ssys 0m4.029s---------------------------------------Running: time ruby foreach.rb gb2.txt54054055 lines readreal    0m19.992suser    0m18.765ssys 0m1.194s

Reading the 3GB file:

Running: time ruby readlines.rb gb3.txt81081082 lines readreal    2m7.260suser    1m57.410ssys 0m7.007s---------------------------------------Running: time ruby foreach.rb gb3.txt81081082 lines readreal    0m33.116suser    0m30.790ssys 0m2.134s

Notice how readlines runs twice as slow each time the file size increases, and using foreach slows linearly. At 1MB, we can see there's something affecting the "slurping" I/O that doesn't affect reading line-by-line. And, because 1MB files are very common these days, it's easy to see they'll slow the processing of files over the lifetime of a program if we don't think ahead. A couple seconds here or there aren't much when they happen once, but if they happen multiple times a minute it adds up to a serious performance impact by the end of a year.

I ran into this problem years ago when processing large data files. The Perl code I was using would periodically stop as it reallocated memory while loading the file. Rewriting the code to not slurp the data file, and instead read and process it line-by-line, gave a huge speed improvement from over five minutes to run to less than one and taught me a big lesson.

"slurping" a file is sometimes useful, especially if you have to do something across line boundaries, however, it's worth spending some time thinking about alternate ways of reading a file if you have to do that. For instance, consider maintaining a small buffer built from the last "n" lines and scan it. That will avoid memory management issues caused by trying to read and hold the entire file. This is discussed in a Perl-related blog "Perl Slurp-Eaze" which covers the "whens" and "whys" to justify using full file-reads, and applies well to Ruby.

For other excellent reasons not to "slurp" your files, read "How to search file text for a pattern and replace it with a given value".


This is kind of old, but I'm a little surprised that no one makes a mention that slurping an input file makes the program practically useless for pipelines. In a pipeline, the input file might be small but slow. If your program is slurping that means it's not working with the data as it becomes available and rather makes you wait for however long it might take for the input to complete. How long? It could be anything, like hours or days, more or less, if I'm doing a grep or find in a big hierarchy. It could also be designed to not complete, like an infinite file. For example, journalctl -f will continue to output whatever events happen in the system without stopping; tshark will output whatever it sees going on in the network without stopping; ping will continue pinging without stopping. /dev/zero is infinite, /dev/urandom is infinite.

The only time I could see slurping as acceptable would maybe be in configuration files, since the program is probably not able to do anything anyway until it finishes reading that.


Why is "slurping" a file not a good practice for normal text-file I/O

The Tin Man hits it right. I'd also like to add:

  • In many cases, reading the entire file into memory is not tractable (because either the file is too big, or the string manipulations have exponential O() space)

  • Often times, you cannot anticipate the file size (special case of above)

  • You should always try to be cognizant of memory usage, and reading all the file in at once (even in trivial situations) is not good practice if an alternative option exists (eg, line-by-line). I know from experience that VBS is horrible in this sense and one is forced into manipulating files through the command line.

This concept applies not just for files, but for any other process where your memory size grows quickly and you have to handle each iteration (or line) at a time. Generator functions help you out by handling the process, or line read, one by one so as to not work with all the data in memory.

As an aside/extra, Python is very smart in reading files in and its open() method is designed to read line-by-line by default. See "Improve Your Python: 'yield' and Generators Explained" which explains a good use case example for generator functions.