How to speed-up sed that uses Regex on very large single cell BAM file
Another awk, pretty much like @tripleee's, I'd assume:
$ samtools view -h small.bam | awk 'match($0,/CB:Z:[ACGT]*/) { # use match for the regex match a[substr($0,RSTART+5,RLENGTH-5)]++ # len(CB:z:)==5, hence +-5}END { for(i in a) print i,a[i] # sample output,tweak it to your liking}'
Sample output:
...TCTTAATCGTCC 175GGGAAGGCCTAA 190TCGGCCGATCGG 32GACTTCCAAGCC 76CCGCGGCATCGG 36TAGCGATCGTGG 125...
Notice: Your sed 's/.*CB:Z:...
matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)...
matches the first.
Notice 2: Quoting @Sundeep in the comments: - - using LC_ALL=C mawk '..'
will give even better speed.
With perl
perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}'
CB:Z:\K[ACGT]++
will match any sequence of ACGT
characters preceded by CB:Z:
. \K
is used here to prevent CB:Z:
from being part of matched portion, which is available via $&
variable
Sample time with small.bam
input file. mawk
is fastest for this input, but it might change for larger input file.
# script.awk is the one mentioned in James Brown's answer# result here shown with GNU awk$ time LC_ALL=C awk -f script.awk small.bam > f1real 0m0.092s# mawk is faster compared to GNU awk for this use case$ time LC_ALL=C mawk -f script.awk small.bam > f2real 0m0.054s$ time perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}' small.bam > f3real 0m0.064s$ diff -sq <(sort f1) <(sort f2)Files /dev/fd/63 and /dev/fd/62 are identical$ diff -sq <(sort f1) <(sort f3)Files /dev/fd/63 and /dev/fd/62 are identical
Better to avoid parsing the output of samtools view
in the first place. Here's one way to get what you need just using python and the pysam library:
import pysamfrom collections import defaultdictcounts = defaultdict(int)tag = 'CB'with pysam.AlignmentFile('small.bam') as sam: for aln in sam: if aln.has_tag(tag): counts[ aln.get_tag(tag) ] += 1for k, v in counts.items(): print(k, v)