Problems using awk to select group of sequences from fasta file Problems using awk to select group of sequences from fasta file unix unix

Problems using awk to select group of sequences from fasta file


I would advise you to use an available fasta format parser.

For instance, in python3, you can use the pretty efficient parser from pyGATB.

You could use it as follows:

#!/usr/bin/env python3import sysfrom gatb import Bankfasta_file = sys.argv[1]pop_name = sys.argv[2]def get_pop(header):    """Extracts the population name from the fasta header."""    return header.decode("utf-8").split(" ")[1].split("_")[0][1:]for seq in Bank(fasta_file):    if get_pop(seq.comment) == pop_name:        print(">%s\n%s" % (            seq.comment.decode("utf-8"),            seq.sequence.decode("utf-8")))sys.exit(0)

Running this with your example file (which strangely has an empty sequence for CLocus_12706_Sample_46_Locus_34641_Allele_0 [JoJo_s115.fq; groupI, 125578, +]):

./extract_pop.py test.fa "JoJo">CLocus_12706_Sample_44_Locus_36326_Allele_0 [JoJo_s113.fq; groupI, 125578, +]TGCAGCATGCTGGTGAACGCGTCATCATAAGCCTGTTGGCGAGCCAGCAGAAGGCGGCATGGGCAGCACTTAATAGGACGCACGTCCTCTGTGTCA>CLocus_12706_Sample_46_Locus_34641_Allele_0 [JoJo_s115.fq; groupI, 125578, +]

If you don't have python3, you can use Biopython's SeqIO module:

#!/usr/bin/env pythonimport sysfrom Bio import SeqIOfasta_file = sys.argv[1]pop_name = sys.argv[2]def get_pop(header):    """Extracts the population name from the fasta header."""    return header.split(" ")[1].split("_")[0][1:]for seq in SeqIO.parse(fasta_file, format="fasta"):    if get_pop(seq.description) == pop_name:        print(">%s\n%s" % (seq.description, seq.seq))sys.exit(0)