List files that contain `n` or fewer lines
With GNU awk for nextfile and ENDFILE:
awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt
With any awk:
awk -v n=27 ' { fnrs[FILENAME] = FNR } END { for (i=1; i<ARGC; i++) { filename = ARGV[i] if ( fnrs[filename] < n ) { print filename } } }' *.txt
Those will both work whether the input files are empty or not. The caveats for the non-gawk version are the same as for your other current awk answers:
- It relies on the same file name not appearing multiple times (e.g.
awk 'script' foo bar foo
) and you wanting it displayed multiple times, and - It relies on there being no variables set in the arg list (e.g.
awk 'script' foo FS=, bar
)
The gawk version has no such restrictions.
UPDATE:
To test the timing between the above GNU awk script and the GNU grep+sed script posted by xhienne since she stated that her solution would be faster than a pure awk script
I created 10,000 input files, all of 0 to 1000 lines in length by using this script:
$ awk -v numFiles=10000 -v maxLines=1000 'BEGIN{for (i=1;i<=numFiles;i++) {numLines=int(rand()*(maxLines+1)); out="out_"i".txt"; printf "" > out; for (j=1;j<=numLines; j++) print ("foo" j) > out} }'
and then ran the 2 commands on them and got these 3rd run timing results:
$ time grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//' > out.grepsedreal 0m1.326suser 0m0.249ssys 0m0.654s$ time awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt > out.awkreal 0m1.092suser 0m0.343ssys 0m0.748s
Both scripts produced the same output files. The above was run in bash on cygwin. I expect on different systems the timing results might vary a little but the difference will always be negligible.
To print 10 lines of up to 20 random chars per line (see the comments):
$ maxChars=20 LC_ALL=C tr -dc '[:print:]' </dev/urandom | fold -w "$maxChars" | awk -v maxChars="$maxChars" -v numLines=10 ' { print substr($0,1,rand()*(maxChars+1)) } NR==numLines { exit } '0J)-8MzO2V\XA/o'qJH@r5|g<WOP780^O@bM\vP{l^pgKUFH9-6r&]/-6dl}pp W&.UnTYLoi['2CEtBY~wrM3>4{^F1mc9?~NHh}a-EEV=O1!yof
To do it all within awk (which will be much slower):
$ cat tst.awkBEGIN { for (i=32; i<127; i++) { chars[++charsSize] = sprintf("%c",i) } minChars = 1 maxChars = 20 srand() for (lineNr=1; lineNr<=10; lineNr++) { numChars = int(minChars + rand() * (maxChars - minChars + 1)) str = "" for (charNr=1; charNr<=numChars; charNr++) { charsIdx = int(1 + rand() * charsSize) str = str chars[charsIdx] } print str }}$ awk -f tst.awkHeer H{QQ?qHDv|PsuqEy`-:O2v7[]|N^EJ0j#@/y>CJ3:=3*b-joG:?^|O.[tYlmDoTjLw`2Rs=!('IChui
If you are using GNU grep
(unfortunately MacOSX >= 10.8 provides BSD grep whose -m
and -c
options act globally, not per file), you may find this alternative interesting (and faster than a pure awk
script):
grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//'
Explanation:
grep -c -m28 -H ^ *.txt
outputs the name of each file with the number of lines in each file, but never reading more than 28 linessed '/:28$/ d; s/:[^:]*$//'
removes the files that have at least 28 lines, and print the filename of the others
Alternate version: sequential processing instead of a parallel one
res=$(grep -c -m28 -H ^ $files); sed '/:28$/ d; s/:[^:]*$//' <<< "$res"
Benchmarking
Ed Morton challenged my claim that this answer may be faster than awk
. He added some benchmarks to his answer and, although he does not give any conclusion, I consider the results he posted are misleading, showing a greater wall-clock time for my answer without any regard to user and sys times. Therefore, here are my results.
First the test platform:
A four-core Intel i5 laptop running Linux, probably quite close to OP's system (Apple iMac).
A brand new directory of 100.000 text files with ~400 lines in average, for a total of 640 MB which is kept entirely in my system buffers. The files were created with this command:
for ((f = 0; f < 100000; f++)); do echo "File $f..."; for ((l = 0; l < RANDOM & 1023; l++)); do echo "File $f; line $l"; done > file_$f.txt; done
Results:
- grep+sed (this answer) : 561 ms elapsed, 586 ms user+sys
- grep+sed (this answer, sequential version) : 678 ms elapsed, 688 ms user+sys
- awk (Ed Morton): 1050 ms elapsed, 1036 ms user+sys
- awk (tripleee): 1137 ms elapsed, 1123 ms user+sys
- awk (anubhava): 1150 ms elapsed, 1137 ms user+sys
- awk (kvantour): 1280 ms elapsed, 1266 ms user+sys
- python (Joey Harrington): 1543 ms elapsed, 1537 ms user+sys
- find+xargs+sed (agc): 91 s elapsed, 10 s user+sys
- for+awk (Jeff Schaller): 247 s elapsed, 83 s user+sys
- find+bash+grep (hek2mgl): 356 s elapsed, 116 s user+sys
Conclusion:
At the time of writing, on a regular Unix multi-core laptop similar to OP's machine, this answer is the fastest that gives accurate results. On my machine, it is twice as fast as the fastest awk script.
Notes:
Why does the platform matter? Because my answer relies on parallelizing the processing between
grep
andsed
. Of course, for unbiased results, if you have only one CPU core (VM?) or other limitations by your OS regarding CPU allocation, you should benchmark the alternate (sequential) version.Obviously, you can't conclude on the wall time alone since it depends on the number of concurrent processes asking for the CPU vs the number of cores on the machine. Therefore I have added the user+sys timings
Those timings are an average over 20 runs, except when the command took more than 1 minute (one run only)
For all the answers that take less than 10 s, the time spent by the shell to process
*.txt
is not negligible, therefore I preprocessed the file list, put it in a variable, and appended the content of the variable to the command I was benchmarking.All answers gave the same results except 1. tripleee's answer which includes
argv[0]
("awk") in its result (fixed in my tests); 2. kvantour's answer which only listed empty files (fixed with-v n=27
); and 3. the find+sed answer which miss empty files (not fixed).I couldn't test ctac_'s answer since I have no GNU sed 4.5 at hand. It is probably the fastest of all but also misses empty files.
The python answer doesn't close its files. I had to do
ulimit -n hard
first.
You may try this awk
that moves to next file as soon as line count goes above 27
:
awk -v n=27 'BEGIN{for (i=1; i<ARGC; i++) f[ARGV[i]]}FNR > n{delete f[FILENAME]; nextfile}END{for (i in f) print i}' *.txt
awk
processes files line by line so it won't attempt to read complete file to get the line count.