DocumentCode
3292419
Title
Approximate Similarity Search in Genomic Sequence Databases Using Landmark-Guided Embedding
Author
Sacan, Ahmet ; Toroslu, I. Hakki
Author_Institution
Ohio State Univ., Columbus
fYear
2008
fDate
11-12 April 2008
Firstpage
43
Lastpage
50
Abstract
Similarity search in sequence databases is of paramount importance in bioinformatics research. As the size of the genomic databases increases, similarity search of proteins in these databases becomes a bottle-neck in large-scale studies, calling for more efficient methods of content-based retrieval. In this study, we present a metric-preserving, landmark-guided embedding approach to represent sequences in the vector domain in order to allow efficient indexing and similarity search. We analyze various properties of the embedding and show that the approximation achieved by the embedded representation is sufficient to achieve biologically relevant results. The approximate representation is shown to provide several orders of magnitude speed-up in similarity search compared to the exact representation, while maintaining comparable search accuracy.
Keywords
biology computing; content-based retrieval; database management systems; proteins; bioinformatics research; content-based retrieval; genomic sequence databases; landmark-guided embedding; proteins; Application software; Bioinformatics; Data engineering; Databases; Genomics; Indexing; Large-scale systems; Matrices; Proteins; Sequences; approximate similarity search; database; indexing; metric space; multi-dimensional scaling; proteins; sequences;
fLanguage
English
Publisher
ieee
Conference_Titel
Similarity Search and Applications, 2008. SISAP 2008. First International Workshop on
Conference_Location
Belfast
Print_ISBN
0-7695-3101-6
Type
conf
DOI
10.1109/SISAP.2008.7
Filename
4492924
Link To Document