Print many specific rows from a text file using an index file
The script in your question will be extremely fast since all it does is a hash lookup of the current line number in the array h
. This will be faster still though unless you want to print the last line number from reads.fastq since it exits after the last desired line number is printed instead of continuing reading the rest of reads.fastq:
awk 'FNR==NR{h[$1]; c++; next} FNR in h{print; if (!--c) exit}' takeThese.txt reads.fastq
You could throw in a delete h[FNR];
after the print;
to reduce the array size and so MAYBE speed up the lookup time but idk if that will really improve performance any since the array access is a hash lookup and so will be extremely fast so adding a delete
may end up slowing the script down overall.
Actually, this will be faster still since it avoids testing NR==FNR for every line in both files:
awk -v nums='takeThese.txt' ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit}' reads.fastq
Whether that is faster or the script @glennjackman posted is faster depends on how many lines are in takeThese.txt and how close to the end of the reads.fastq they occur. Since Glenns reads the whole of reads.fastq no matter what the contents of takeThese.txt it'll execute in about constant time while mine will be significantly faster the further from the end of reads.fastq the last line number in takeThese.txt occurs. e.g.
$ awk 'BEGIN {for(i=1;i<=100000000;i++) print i}' > reads.fastq
.
$ awk 'BEGIN {for(i=1;i<=1000000;i++) print i*100}' > takeThese.txt$ time awk -v nums=takeThese.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() }' reads.fastq > /dev/nullreal 0m28.720suser 0m27.876ssys 0m0.450s$ time awk -v nums=takeThese.txt ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit}' reads.fastq > /dev/nullreal 0m50.060suser 0m47.564ssys 0m0.405s
.
$ awk 'BEGIN {for(i=1;i<=100;i++) print i*100}' > takeThat.txt$ time awk -v nums=takeThat.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() }' reads.fastq > /dev/nullreal 0m26.738suser 0m23.556ssys 0m0.310s$ time awk -v nums=takeThat.txt ' BEGIN{ while ((getline i < nums) > 0) {h[i]; c++} } NR in h{print; if (!--c) exit}' reads.fastq > /dev/nullreal 0m0.094suser 0m0.015ssys 0m0.000s
but you can have the best of both worlds with:
$ time awk -v nums=takeThese.txt ' function next_index() { if ( ( ("sort -n " nums) | getline i) > 0 ) { return i } else { exit } } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() }' reads.fastq > /dev/nullreal 0m28.057suser 0m26.675ssys 0m0.498s$ time awk -v nums=takeThat.txt ' function next_index() { if ( ( ("sort -n " nums) | getline i) > 0 ) { return i } else { exit } } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() }' reads.fastq > /dev/nullreal 0m0.094suser 0m0.030ssys 0m0.062s
which if we assume takeThese.txt is already sorted can be reduced to just:
$ time awk -v nums=takeThese.txt ' BEGIN { getline linenum < nums } NR == linenum { print; if ((getline linenum < nums) < 1) exit }' reads.fastq > /dev/nullreal 0m27.362suser 0m25.599ssys 0m0.280s$ time awk -v nums=takeThat.txt ' BEGIN { getline linenum < nums } NR == linenum { print; if ((getline linenum < nums) < 1) exit }' reads.fastq > /dev/nullreal 0m0.047suser 0m0.030ssys 0m0.016s
I think that solution in the question stores all lines from takeThese.txt into array, h[], and then for each line in reads.fastq does a linear search in h[] for that line number.
There are several simple improvements to that in different languages. I would try perl if you're not comfortable with java.
Basically you should make sure takeThese.txt is sorted, then just go through reads.fastq one line at at time, scanning for a line number that matches the next line number from takeThese.txt, then pop that and continue.
Since rows are of different length you have no choice but to scan for newline character (basic for each line
-construct in most languages).
Example in perl, quick and dirty but works
open(F1,"reads.fastq");open(F2,"takeThese.txt");$f1_pos = 1;foreach $index (<F2>) { while ($f1_pos <= $index) { $out = <F1>; $f1_pos++; } print $out;}
I would try one of these
may result in false positives:
cat -n reads.fastq | grep -Fwf takeThese.txt | cut -d$'\t' -f20
requires one of {bash,ksh,zsh}:
sed -n -f <(sed 's/$/p/' takeThese.txt) reads.fastq
this is similar to Andreas Wederbrand's perl answer, implemented in awk
awk -v nums=takeThese.txt ' function next_index() { ("sort -n " nums) | getline i return i } BEGIN { linenum = next_index() } NR == linenum { print; linenum = next_index() }' reads.fastq
But, if you're dealing with a lot of data, text processing tools will take time. Your other option is to import the data into a proper database and use SQL to extract it: database engines are built for this kind of stuff.