List files that contain `n` or fewer lines List files that contain `n` or fewer lines bash bash

List files that contain `n` or fewer lines


With GNU awk for nextfile and ENDFILE:

awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt

With any awk:

awk -v n=27 '    { fnrs[FILENAME] = FNR }    END {        for (i=1; i<ARGC; i++) {            filename = ARGV[i]            if ( fnrs[filename] < n ) {                print filename            }        }    }' *.txt

Those will both work whether the input files are empty or not. The caveats for the non-gawk version are the same as for your other current awk answers:

  1. It relies on the same file name not appearing multiple times (e.g. awk 'script' foo bar foo) and you wanting it displayed multiple times, and
  2. It relies on there being no variables set in the arg list (e.g. awk 'script' foo FS=, bar)

The gawk version has no such restrictions.

UPDATE:

To test the timing between the above GNU awk script and the GNU grep+sed script posted by xhienne since she stated that her solution would be faster than a pure awk script I created 10,000 input files, all of 0 to 1000 lines in length by using this script:

$ awk -v numFiles=10000 -v maxLines=1000 'BEGIN{for (i=1;i<=numFiles;i++) {numLines=int(rand()*(maxLines+1)); out="out_"i".txt"; printf "" > out; for (j=1;j<=numLines; j++) print ("foo" j) > out} }'

and then ran the 2 commands on them and got these 3rd run timing results:

$ time grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//' > out.grepsedreal    0m1.326suser    0m0.249ssys     0m0.654s$ time awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt > out.awkreal    0m1.092suser    0m0.343ssys     0m0.748s

Both scripts produced the same output files. The above was run in bash on cygwin. I expect on different systems the timing results might vary a little but the difference will always be negligible.


To print 10 lines of up to 20 random chars per line (see the comments):

$ maxChars=20    LC_ALL=C tr -dc '[:print:]' </dev/urandom |    fold -w "$maxChars" |    awk -v maxChars="$maxChars" -v numLines=10 '        { print substr($0,1,rand()*(maxChars+1)) }        NR==numLines { exit }    '0J)-8MzO2V\XA/o'qJH@r5|g<WOP780^O@bM\vP{l^pgKUFH9-6r&]/-6dl}pp W&.UnTYLoi['2CEtBY~wrM3>4{^F1mc9?~NHh}a-EEV=O1!yof

To do it all within awk (which will be much slower):

$ cat tst.awkBEGIN {    for (i=32; i<127; i++) {        chars[++charsSize] = sprintf("%c",i)    }    minChars = 1    maxChars = 20    srand()    for (lineNr=1; lineNr<=10; lineNr++) {        numChars = int(minChars + rand() * (maxChars - minChars + 1))        str = ""        for (charNr=1; charNr<=numChars; charNr++) {            charsIdx = int(1 + rand() * charsSize)            str = str chars[charsIdx]        }        print str    }}$ awk -f tst.awkHeer H{QQ?qHDv|PsuqEy`-:O2v7[]|N^EJ0j#@/y>CJ3:=3*b-joG:?^|O.[tYlmDoTjLw`2Rs=!('IChui


If you are using GNU grep (unfortunately MacOSX >= 10.8 provides BSD grep whose -m and -c options act globally, not per file), you may find this alternative interesting (and faster than a pure awk script):

grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//'

Explanation:

  • grep -c -m28 -H ^ *.txt outputs the name of each file with the number of lines in each file, but never reading more than 28 lines
  • sed '/:28$/ d; s/:[^:]*$//' removes the files that have at least 28 lines, and print the filename of the others

Alternate version: sequential processing instead of a parallel one

res=$(grep -c -m28 -H ^ $files); sed '/:28$/ d; s/:[^:]*$//' <<< "$res"

Benchmarking

Ed Morton challenged my claim that this answer may be faster than awk. He added some benchmarks to his answer and, although he does not give any conclusion, I consider the results he posted are misleading, showing a greater wall-clock time for my answer without any regard to user and sys times. Therefore, here are my results.

First the test platform:

  • A four-core Intel i5 laptop running Linux, probably quite close to OP's system (Apple iMac).

  • A brand new directory of 100.000 text files with ~400 lines in average, for a total of 640 MB which is kept entirely in my system buffers. The files were created with this command:

    for ((f = 0; f < 100000; f++)); do echo "File $f..."; for ((l = 0; l < RANDOM & 1023; l++)); do echo "File $f; line $l"; done > file_$f.txt; done

Results:

Conclusion:

At the time of writing, on a regular Unix multi-core laptop similar to OP's machine, this answer is the fastest that gives accurate results. On my machine, it is twice as fast as the fastest awk script.

Notes:

  • Why does the platform matter? Because my answer relies on parallelizing the processing between grep and sed. Of course, for unbiased results, if you have only one CPU core (VM?) or other limitations by your OS regarding CPU allocation, you should benchmark the alternate (sequential) version.

  • Obviously, you can't conclude on the wall time alone since it depends on the number of concurrent processes asking for the CPU vs the number of cores on the machine. Therefore I have added the user+sys timings

  • Those timings are an average over 20 runs, except when the command took more than 1 minute (one run only)

  • For all the answers that take less than 10 s, the time spent by the shell to process *.txt is not negligible, therefore I preprocessed the file list, put it in a variable, and appended the content of the variable to the command I was benchmarking.

  • All answers gave the same results except 1. tripleee's answer which includes argv[0] ("awk") in its result (fixed in my tests); 2. kvantour's answer which only listed empty files (fixed with -v n=27); and 3. the find+sed answer which miss empty files (not fixed).

  • I couldn't test ctac_'s answer since I have no GNU sed 4.5 at hand. It is probably the fastest of all but also misses empty files.

  • The python answer doesn't close its files. I had to do ulimit -n hard first.


You may try this awk that moves to next file as soon as line count goes above 27:

awk -v n=27 'BEGIN{for (i=1; i<ARGC; i++) f[ARGV[i]]}FNR > n{delete f[FILENAME]; nextfile}END{for (i in f) print i}' *.txt

awk processes files line by line so it won't attempt to read complete file to get the line count.