How to speed-up sed that uses Regex on very large single cell BAM file

Another awk, pretty much like @tripleee's, I'd assume:

$ samtools view -h small.bam | awk 'match($0,/CB:Z:[ACGT]*/) {               # use match for the regex match    a[substr($0,RSTART+5,RLENGTH-5)]++   # len(CB:z:)==5, hence +-5}END {    for(i in a)        print i,a[i]                     # sample output,tweak it to your liking}'

Sample output:

...TCTTAATCGTCC 175GGGAAGGCCTAA 190TCGGCCGATCGG 32GACTTCCAAGCC 76CCGCGGCATCGG 36TAGCGATCGTGG 125...

Notice: Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.

Notice 2: Quoting @Sundeep in the comments: - - using LC_ALL=C mawk '..' will give even better speed.

linux unix awk sed bioinformatics

With perl

perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}'

CB:Z:\K[ACGT]++ will match any sequence of ACGT characters preceded by CB:Z:. \K is used here to prevent CB:Z: from being part of matched portion, which is available via $& variable

Sample time with small.bam input file. mawk is fastest for this input, but it might change for larger input file.

# script.awk is the one mentioned in James Brown's answer# result here shown with GNU awk$ time LC_ALL=C awk -f script.awk small.bam > f1real    0m0.092s# mawk is faster compared to GNU awk for this use case$ time LC_ALL=C mawk -f script.awk small.bam > f2real    0m0.054s$ time perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}' small.bam > f3real    0m0.064s$ diff -sq <(sort f1) <(sort f2)Files /dev/fd/63 and /dev/fd/62 are identical$ diff -sq <(sort f1) <(sort f3)Files /dev/fd/63 and /dev/fd/62 are identical

linux unix awk sed bioinformatics

Better to avoid parsing the output of samtools view in the first place. Here's one way to get what you need just using python and the pysam library:

import pysamfrom collections import defaultdictcounts = defaultdict(int)tag = 'CB'with pysam.AlignmentFile('small.bam') as sam:    for aln in sam:        if aln.has_tag(tag):            counts[ aln.get_tag(tag) ] += 1for k, v in counts.items():    print(k, v)

CodeHunter

How to speed-up sed that uses Regex on very large single cell BAM file

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last