DocumentCode :
3322088
Title :
Effective indexing and filtering for similarity search in large biosequence databases
Author :
Ozturk, Ozgur ; Ferhatosmanoglu, Hakan
Author_Institution :
Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA
fYear :
2003
fDate :
10-12 March 2003
Firstpage :
359
Lastpage :
366
Abstract :
We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentally compared their (a) approximation quality for k-Nearest Neighbor (k-NN) queries, (b) pruning ability and (c) approximation quality for E-range queries. Results for k-NN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e. Frequency and Wavelet Distance functions for 2-grams) perform significantly better than the others. We then develop effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions. Promising results from the experiments on real biosequence data sets are presented.
Keywords :
DNA; biology computing; proteins; trees (mathematics); vectors; DNA databases; E-range queries; R-trees; approximation quality; distance functions; effective index structures; frequency functions; k-nearest neighbor queries; large biosequence databases; protein databases; pruning ability; real biosequence data sets; scalar quantization; similarity search; wavelet distance functions; Bioinformatics; Biomedical computing; DNA computing; Databases; Filtering; Genomics; Indexing; Information science; Proteins; Sequences;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Bioengineering, 2003. Proceedings. Third IEEE Symposium on
Print_ISBN :
0-7695-1907-5
Type :
conf
DOI :
10.1109/BIBE.2003.1188974
Filename :
1188974
Link To Document :
بازگشت