How to compute multiple sequence alignment for text strings How to compute multiple sequence alignment for text strings python python

How to compute multiple sequence alignment for text strings


  • The easiest way to align multiple sequences is to do a number of pairwise alignments.

First get pairwise similarity scores for each pair and store those scores. This is the most expensive part of the process. Choose the pair that has the best similarity score and do that alignment. Now pick the sequence which aligned best to one of the sequences in the set of aligned sequences, and align it to the aligned set, based on that pairwise alignment. Repeat until all sequences are in.

When you are aligning a sequence to the aligned sequences, (based on a pairwise alignment), when you insert a gap in the sequence that is already in the set, you insert gaps in the same place in all sequences in the aligned set.

Lafrasu has suggested the SequneceMatcher() algorithm to use for pairwise alignment of UTF-8 strings. What I've described gives you a fairly painless, reasonably decent way to extend that to multiple sequences.

In case you are interested, it is equivalent to building up small sets of aligned sequences and aligning them on their best pair. It gives exactly the same result, but it is a simpler implementation.


Are you looking for something quick and dirty, as in the following?

from difflib import SequenceMatchera = "dsa jld lal"b = "dsajld kll"c = "dsc jle kal"d = "dsd jlekal"ss = [a,b,c,d]s = SequenceMatcher()for i in range(len(ss)):    x = ss[i]    s.set_seq1(x)    for j in range(i+1,len(ss)):        y = ss[j]        s.set_seq2(y)        print        print s.ratio()        print s.get_matching_blocks()


MAFFT version 7.120+ supports multiple text alignment. Input is like FASTA format but with LATIN1 text instead of sequences and output is aligned FASTA format. Once installed, it is easy to run:

mafft --text input_text.fa > output_alignment.fa

Although MAFFT is a mature tool for biological sequence alignment, the text alignment mode is in the development stage, with future plans including permitting user defined scoring matrices. You can see the further details in the documentation.