QOMA2: Optimizing the alignment of many sequences

Author

Zhang, Xu ; Kahveci, Tamer

Author_Institution

Univ. of Florida, Gainesville

fYear

2007

fDate

14-17 Oct. 2007

Firstpage

780

Lastpage

787

Abstract

We consider the problem of aligning multiple protein sequences with the goal of maximizing the SP (sum-of-pairs) score, when the number of sequences is large. The QOMA (quasi-optimal multiple alignment) algorithm addressed this problem when the number of sequences is small. However, as the number of sequences increases, QOMA becomes impractical. This paper develops a new algorithm, QOMA2, which optimizes the SP score of the alignment of arbitrarily large number of sequences. Given an initial (potentially sub-optimal) alignment , QOMA2 selects short subsequences from this alignment by placing a window on it. It quickly estimates the amount of improvement that can be obtained by optimizing the alignment of the subsequences in short windows on this alignment. This estimate is called the SW (sum of weights) score. It employs a dynamic programming algorithm that selects the set of window positions with the largest total expected improvement. It partitions the subsequences within each window into clusters such that the number of subsequences in each cluster is small enough to be optimally aligned within a given time. Also, it aims to select these clusters so that the optimal alignment of the subsequences in these clusters produces the highest expected SP score. The experimental results show that QOMA2 produces high SP scores quickly even for large number of sequences. They also show that the SW score and the resulting SP score are highly correlated. This implies that it is promising to aim for optimizing the SW score since it is much cheaper than aligning multiple sequences optimally. The software and the benchmark data set are available from the authors on request.

Keywords

biology computing; dynamic programming; molecular biophysics; molecular configurations; proteins; QOMA2; dynamic programming; optimization; protein sequences; quasi-optimal multiple alignment; sequence alignment; Clustering algorithms; Computational biology; Dynamic programming; Heuristic algorithms; Information science; NP-complete problem; Partitioning algorithms; Phylogeny; Protein engineering; Sequences;

fLanguage

English

Publisher

ieee

Conference_Titel

Bioinformatics and Bioengineering, 2007. BIBE 2007. Proceedings of the 7th IEEE International Conference on

Conference_Location

Boston, MA

Print_ISBN

978-1-4244-1509-0

Type

conf

DOI

10.1109/BIBE.2007.4375649

Filename

4375649