Print many specific rows from a text file using an index file Print many specific rows from a text file using an index file unix unix

Print many specific rows from a text file using an index file


The script in your question will be extremely fast since all it does is a hash lookup of the current line number in the array h. This will be faster still though unless you want to print the last line number from reads.fastq since it exits after the last desired line number is printed instead of continuing reading the rest of reads.fastq:

awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq

You could throw in a delete h[FNR]; after the print; to reduce the array size and so MAYBE speed up the lookup time but idk if that will really improve performance any since the array access is a hash lookup and so will be extremely fast so adding a delete may end up slowing the script down overall.

Actually, this will be faster still since it avoids testing NR==FNR for every line in both files:

awk -v nums='takeThese.txt' '    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }    NR in h{print; if (!--c) exit}' reads.fastq

Whether that is faster or the script @glennjackman posted is faster depends on how many lines are in takeThese.txt and how close to the end of the reads.fastq they occur. Since Glenns reads the whole of reads.fastq no matter what the contents of takeThese.txt it'll execute in about constant time while mine will be significantly faster the further from the end of reads.fastq the last line number in takeThese.txt occurs. e.g.

$ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq

.

$ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt$ time awk -v nums=takeThese.txt '    function next_index() {        ("sort -n " nums) | getline i        return i    }    BEGIN { linenum = next_index() }    NR == linenum { print; linenum = next_index() }' reads.fastq > /dev/nullreal    0m28.720suser    0m27.876ssys     0m0.450s$ time awk -v nums=takeThese.txt '    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }    NR in h{print; if (!--c) exit}' reads.fastq > /dev/nullreal    0m50.060suser    0m47.564ssys     0m0.405s

.

$ awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt$ time awk -v nums=takeThat.txt '    function next_index() {        ("sort -n " nums) | getline i        return i    }    BEGIN { linenum = next_index() }    NR == linenum { print; linenum = next_index() }' reads.fastq > /dev/nullreal    0m26.738suser    0m23.556ssys     0m0.310s$ time awk -v nums=takeThat.txt '    BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} }    NR in h{print; if (!--c) exit}' reads.fastq > /dev/nullreal    0m0.094suser    0m0.015ssys     0m0.000s

but you can have the best of both worlds with:

$ time awk -v nums=takeThese.txt '    function next_index() {        if ( ( ("sort -n " nums) | getline i) > 0 ) {            return i        }        else {            exit        }    }    BEGIN { linenum = next_index() }    NR == linenum { print; linenum = next_index() }' reads.fastq > /dev/nullreal    0m28.057suser    0m26.675ssys     0m0.498s$ time awk -v nums=takeThat.txt '    function next_index() {        if ( ( ("sort -n " nums) | getline i) > 0 ) {            return i        }        else {            exit        }    }    BEGIN { linenum = next_index() }    NR == linenum { print; linenum = next_index() }' reads.fastq > /dev/nullreal    0m0.094suser    0m0.030ssys     0m0.062s

which if we assume takeThese.txt is already sorted can be reduced to just:

$ time awk -v nums=takeThese.txt '    BEGIN { getline linenum < nums }    NR == linenum { print; if ((getline linenum < nums) < 1) exit }' reads.fastq > /dev/nullreal    0m27.362suser    0m25.599ssys     0m0.280s$ time awk -v nums=takeThat.txt '    BEGIN { getline linenum < nums }    NR == linenum { print; if ((getline linenum < nums) < 1) exit }' reads.fastq > /dev/nullreal    0m0.047suser    0m0.030ssys     0m0.016s


I think that solution in the question stores all lines from takeThese.txt into array, h[], and then for each line in reads.fastq does a linear search in h[] for that line number.

There are several simple improvements to that in different languages. I would try perl if you're not comfortable with java.

Basically you should make sure takeThese.txt is sorted, then just go through reads.fastq one line at at time, scanning for a line number that matches the next line number from takeThese.txt, then pop that and continue.

Since rows are of different length you have no choice but to scan for newline character (basic for each line-construct in most languages).

Example in perl, quick and dirty but works

open(F1,"reads.fastq");open(F2,"takeThese.txt");$f1_pos = 1;foreach $index (<F2>) {   while ($f1_pos <= $index) {      $out = <F1>; $f1_pos++;   }    print $out;}


I would try one of these

  1. may result in false positives:

    cat -n reads.fastq | grep -Fwf takeThese.txt | cut -d$'\t' -f20
  2. requires one of {bash,ksh,zsh}:

    sed -n -f <(sed 's/$/p/' takeThese.txt) reads.fastq
  3. this is similar to Andreas Wederbrand's perl answer, implemented in awk

    awk -v nums=takeThese.txt '    function next_index() {        ("sort -n " nums) | getline i        return i    }    BEGIN { linenum = next_index() }    NR == linenum { print; linenum = next_index() }' reads.fastq

But, if you're dealing with a lot of data, text processing tools will take time. Your other option is to import the data into a proper database and use SQL to extract it: database engines are built for this kind of stuff.