Merge two large text files by common row to one mapping file Merge two large text files by common row to one mapping file unix unix

Merge two large text files by common row to one mapping file


In my opinion, the easiest way would be to use BLAST+...

Set up the larger file as a BLAST database and use the smaller file as the query...

Then just write a small script to analyse the output - I.e. Take the top hit or two to create the mapping file.

BTW. You might find SequenceServer (Google it) helpful in setting up a custom Blast database and your BLAST environment...


BioPython should be able to read in large FASTA files.

from Bio import SeqIOfrom collections import defaultdictmapping = defaultdict(list)for stool_record in SeqIO.parse('stool.fasta', 'fasta'):    stool_seq = str(stool_record.seq)    for lib_record in SeqIO.parse('libs.fasta', 'fasta'):        lib_seq = str(lib_record.seq)        if stool_seq.startswith(lib_seq):            mapping[lib_record.id.split(';')[0]].append(stool_record.id)