How to compute multiple sequence alignment for text strings

The easiest way to align multiple sequences is to do a number of pairwise alignments.

First get pairwise similarity scores for each pair and store those scores. This is the most expensive part of the process. Choose the pair that has the best similarity score and do that alignment. Now pick the sequence which aligned best to one of the sequences in the set of aligned sequences, and align it to the aligned set, based on that pairwise alignment. Repeat until all sequences are in.

When you are aligning a sequence to the aligned sequences, (based on a pairwise alignment), when you insert a gap in the sequence that is already in the set, you insert gaps in the same place in all sequences in the aligned set.

Lafrasu has suggested the SequneceMatcher() algorithm to use for pairwise alignment of UTF-8 strings. What I've described gives you a fairly painless, reasonably decent way to extend that to multiple sequences.

In case you are interested, it is equivalent to building up small sets of aligned sequences and aligning them on their best pair. It gives exactly the same result, but it is a simpler implementation.

python string text alignment sequence

Are you looking for something quick and dirty, as in the following?

from difflib import SequenceMatchera = "dsa jld lal"b = "dsajld kll"c = "dsc jle kal"d = "dsd jlekal"ss = [a,b,c,d]s = SequenceMatcher()for i in range(len(ss)):    x = ss[i]    s.set_seq1(x)    for j in range(i+1,len(ss)):        y = ss[j]        s.set_seq2(y)        print        print s.ratio()        print s.get_matching_blocks()

python string text alignment sequence

MAFFT version 7.120+ supports multiple text alignment. Input is like FASTA format but with LATIN1 text instead of sequences and output is aligned FASTA format. Once installed, it is easy to run:

mafft --text input_text.fa > output_alignment.fa

Although MAFFT is a mature tool for biological sequence alignment, the text alignment mode is in the development stage, with future plans including permitting user defined scoring matrices. You can see the further details in the documentation.

CodeHunter

How to compute multiple sequence alignment for text strings

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last