Faster Alternative to Unix Grep Faster Alternative to Unix Grep unix unix

Faster Alternative to Unix Grep


Use time command with all these

$> time grep ">" file.fasta > output.txt$> time egrep ">" file.fasta > output.txt$> time awk  '/^>/{print $0}' file.fasta > output.txt -- If ">' is first letter

If you see the output..they are almost the same .

In my opinion ,if the data is in columnar format, then use awk to search.


Hand-built state machine. If you only want '>' to be accepted at the beginning of the line, you'll need one more state. If you need to recognise '\r' too, you will need a few more states.

#include <stdio.h>int main(void){int state,ch;for(state=0; (ch=getc(stdin)) != EOF;   ) {        switch(state) {        case 0: /* start */                if (ch == '>') state = 1;                else break;        case 1: /* echo */                fputc(ch,stdout);                if (ch == '\n') state = 0;                break;                }        }if (state==1) fputc('\n',stdout);return 0;}

If you want real speed, you could replace the fgetc() and fputc() by their macro equivalents getc() and putc(). (but I think trivial programs like this will be I/O bound anyway)


For big files, the fastest possible grep can be accomplished with GNU parallel. An example using parallel and grep can be found here.

For your purposes, you may like to try:

cat file.fasta | parallel -j 4 --pipe --block 10M grep "^\>" > output.txt

The above will use four cores, and parse 10 MB blocks to grep. The block-size is optional, but I find using a 10 MB block-size quite a bit faster on my system. YRMV.

HTH