How to speed-up sed that uses Regex on very large single cell BAM file How to speed-up sed that uses Regex on very large single cell BAM file unix unix

How to speed-up sed that uses Regex on very large single cell BAM file


Another awk, pretty much like @tripleee's, I'd assume:

$ samtools view -h small.bam | awk 'match($0,/CB:Z:[ACGT]*/) {               # use match for the regex match    a[substr($0,RSTART+5,RLENGTH-5)]++   # len(CB:z:)==5, hence +-5}END {    for(i in a)        print i,a[i]                     # sample output,tweak it to your liking}' 

Sample output:

...TCTTAATCGTCC 175GGGAAGGCCTAA 190TCGGCCGATCGG 32GACTTCCAAGCC 76CCGCGGCATCGG 36TAGCGATCGTGG 125...

Notice: Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.

Notice 2: Quoting @Sundeep in the comments: - - using LC_ALL=C mawk '..' will give even better speed.


With perl

perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}'

CB:Z:\K[ACGT]++ will match any sequence of ACGT characters preceded by CB:Z:. \K is used here to prevent CB:Z: from being part of matched portion, which is available via $& variable


Sample time with small.bam input file. mawk is fastest for this input, but it might change for larger input file.

# script.awk is the one mentioned in James Brown's answer# result here shown with GNU awk$ time LC_ALL=C awk -f script.awk small.bam > f1real    0m0.092s# mawk is faster compared to GNU awk for this use case$ time LC_ALL=C mawk -f script.awk small.bam > f2real    0m0.054s$ time perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}' small.bam > f3real    0m0.064s$ diff -sq <(sort f1) <(sort f2)Files /dev/fd/63 and /dev/fd/62 are identical$ diff -sq <(sort f1) <(sort f3)Files /dev/fd/63 and /dev/fd/62 are identical


Better to avoid parsing the output of samtools view in the first place. Here's one way to get what you need just using and the pysam library:

import pysamfrom collections import defaultdictcounts = defaultdict(int)tag = 'CB'with pysam.AlignmentFile('small.bam') as sam:    for aln in sam:        if aln.has_tag(tag):            counts[ aln.get_tag(tag) ] += 1for k, v in counts.items():    print(k, v)