Title :
An efficient data structure for applying multiple seeds in homology search
Author :
Khodabakhshi, Alireza Hadj ; Mirzazadeh, Mehdi ; Gupta, Arvind
Author_Institution :
Simon Fraser Univ., Burnaby
Abstract :
Homology search is a fundamental tool in genomics and proteomics research and considerable attention has been paid to the development of faster homology finding algorithms. The most efficient homology search algorithms known are based on finding short exact or near exact subsequences and then extending these to form long matches. Traditional tools such as FASTA and BLAST use the consecutive seed model, that is, an exact short match is found and then extended. More recently, PatternHunter, which uses the spaced seeds model, has shown considerably better sensitivity and attention has been focused on the design of optimal seeds for specific data sets. In some cases, the use of multiple seeds has been shown to be necessary. Here we focus on the problem of applying multiple seeds to a dataset in an efficient manner. In particular, when multiple seeds are applied, each seed can result in the same subsequence being found necessitating the removal of duplicates. We propose an efficient data structure that, given a set of seeds and two input sequences, reports the set of matching subsequences (that is, reports matching subsequences without duplications). We show that for any number of seeds, our algorithm is asymptotically optimal when the input sequences are longer than a certain threshold.
Keywords :
biology computing; cellular biophysics; data structures; genetics; molecular biophysics; molecular configurations; proteins; data structure; genomics; homology search; matching subsequences; multiple seeds; proteomics; Bioinformatics; Data structures; Databases; Dynamic programming; Genetic mutations; Genomics; In vitro; Pattern matching; Proteomics; Robustness;
Conference_Titel :
Bioinformatics and Bioengineering, 2007. BIBE 2007. Proceedings of the 7th IEEE International Conference on
Conference_Location :
Boston, MA
Print_ISBN :
978-1-4244-1509-0
DOI :
10.1109/BIBE.2007.4375750