An Approximation Algorithm for Alignment of Multiple Sequences using Motif Discovery
Abstract
Given a set of N sequence, the Multiple Sequence Alignment problem is to align these N sequences, possibly with gaps, that brings out the best commonality of the N sequences. The quality of the alignment is usually measured by penalizing the mis-matches and gaps, and rewarding the matches with appropriate weight functions. However for larger values of N, additional constraints are required to give meaningful alignments. We identify a user-controlled parameter, an alignment number K (2 ≤ K ≤ N): this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever possible, in the alignment. We identify a natural optimization problem for this approach called the K-MSA problem. We show that the problem is MAX SNP hard. We give a natural extension of this problem that incorporates "biological relevance" by using motifs (common patterns in the sequences) and give an approximation algorithm for this problem in terms of the motifs in the data. MUSCA is an implementation of this approach and our experimental results indicate that this approach is efficient, particularly on large numbers of long sequences, and gives good alignments when tested on biological data such as DNA and protein sequences.