Performance issue with parsing large log files (~5gb) using awk, grep, sed Performance issue with parsing large log files (~5gb) using awk, grep, sed unix unix

Performance issue with parsing large log files (~5gb) using awk, grep, sed


You need to perform some tests to find out where your bottlenecks are, and how fast your various tools perform. Try some tests like this:

time fgrep '2064351200' example.log >/dev/nulltime egrep '2064351200' example.log >/dev/nulltime sed -e '/2064351200/!d' example.log >/dev/nulltime awk '/2064351200/ {print}' example.log >/dev/null

Traditionally, egrep should be the fastest of the bunch (yes, faster than fgrep), but some modern implementations are adaptive and automatically switch to the most appropriate searching algorithm. If you have bmgrep (which uses the Boyer-Moore search algorithm), try that. Generally, sed and awk will be slower because they're designed as more general-purpose text manipulation tools rather than being tuned for the specific job of searching. But it really depends on the implementation, and the correct way to find out is to run tests. Run them each several times so you don't get messed up by things like caching and competing processes.

As @Ron pointed out, your search process may be disk I/O bound. If you will be searching the same log file a number of times, it may be faster to compress the log file first; this makes it faster to read off disk, but then require more CPU time to process because it has to be decompressed first. Try something like this:

compress -c example2.log >example2.log.Ztime zgrep '2064351200' example2.log.Z >/dev/nullgzip -c example2.log >example2.log.gztime zgrep '2064351200' example2.log.gz >/dev/nullbzip2 -k example.logtime bzgrep '2064351200' example.log.bz2 >/dev/null

I just ran a quick test with a fairly compressible text file, and found that bzip2 compressed best, but then took far more CPU time to decompress, so the zgip option wound up being fastest overall. Your computer will have different disk and CPU performance than mine, so your results may be different. If you have any other compressors lying around, try them as well, and/or try different levels of gzip compression, etc.

Speaking of preprocessing: if you're searching the same log over and over, is there a way to preselect out just the log lines that you might be interested in? If so, grep them out into a smaller (maybe compressed) file, then search that instead of the whole thing. As with compression, you spend some extra time up front, but then each individual search runs faster.

A note about piping: other things being equal, piping a huge file through multiple commands will be slower than having a single command do all the work. But all things are not equal here, and if using multiple commands in a pipe (which is what zgrep and bzgrep do) buys you better overall performance, go for it. Also, consider whether you're actually passing all of the data through the entire pipe. In the example you gave, fgrep '2064351200' example.log | fgrep 'action: example', the first fgrep will discard most of the file; the pipe and second command only have to process the small fraction of the log that contains '2064351200', so the slowdown will likely be negligible.

tl;dr TEST ALL THE THINGS!

EDIT: if the log file is "live" (i.e. new entries are being added), but the bulk of it is static, you may be able to use a partial preprocess approach: compress (& maybe prescan) the log, then when scanning use the compressed (&/prescanned) version plus a tail of the part of the log added since you did the prescan. Something like this:

# Precompress:gzip -v -c example.log >example.log.gzcompressedsize=$(gzip -l example.log.gz | awk '{if(NR==2) print $2}')# Search the compressed file + recent additions:{ gzip -cdfq example.log.gz; tail -c +$compressedsize example.log; } | egrep '2064351200'

If you're going to be doing several related searches (e.g. a particular request, then specific actions with that request), you can save prescanned versions:

# Prescan for a particular request (repeat for each request you'll be working with):gzip -cdfq example.log.gz | egrep '2064351200' > prescan-2064351200.log# Search the prescanned file + recent additions:{ cat prescan-2064351200.log; tail -c +$compressedsize example.log | egrep '2064351200'; } | egrep 'action: example'


If you don't know the sequence of your strings, then:

awk '/str1/ && /str2/ && /str3/ && /str4/' filename

If you know that they will appear one following another in the line:

grep 'str1.*str2.*str3.*str4' filename

(note for awk, {print} is the default action block, so it can be omitted if the condition is given)

Dealing with files that large is going to be slow no matter how you slice it.


As to multi-line programs on the command line,

$ awk 'BEGIN { print "File\tOwner" }> { print $8, "\t", \> $3}> END { print " - DONE -" }' infile > outfile

Note the single quotes.