Problems using awk to select group of sequences from fasta file
I would advise you to use an available fasta format parser.
For instance, in python3, you can use the pretty efficient parser from pyGATB.
You could use it as follows:
#!/usr/bin/env python3import sysfrom gatb import Bankfasta_file = sys.argv[1]pop_name = sys.argv[2]def get_pop(header): """Extracts the population name from the fasta header.""" return header.decode("utf-8").split(" ")[1].split("_")[0][1:]for seq in Bank(fasta_file): if get_pop(seq.comment) == pop_name: print(">%s\n%s" % ( seq.comment.decode("utf-8"), seq.sequence.decode("utf-8")))sys.exit(0)
Running this with your example file (which strangely has an empty sequence for CLocus_12706_Sample_46_Locus_34641_Allele_0 [JoJo_s115.fq; groupI, 125578, +]
):
./extract_pop.py test.fa "JoJo">CLocus_12706_Sample_44_Locus_36326_Allele_0 [JoJo_s113.fq; groupI, 125578, +]TGCAGCATGCTGGTGAACGCGTCATCATAAGCCTGTTGGCGAGCCAGCAGAAGGCGGCATGGGCAGCACTTAATAGGACGCACGTCCTCTGTGTCA>CLocus_12706_Sample_46_Locus_34641_Allele_0 [JoJo_s115.fq; groupI, 125578, +]
If you don't have python3, you can use Biopython's SeqIO
module:
#!/usr/bin/env pythonimport sysfrom Bio import SeqIOfasta_file = sys.argv[1]pop_name = sys.argv[2]def get_pop(header): """Extracts the population name from the fasta header.""" return header.split(" ")[1].split("_")[0][1:]for seq in SeqIO.parse(fasta_file, format="fasta"): if get_pop(seq.description) == pop_name: print(">%s\n%s" % (seq.description, seq.seq))sys.exit(0)