Title :
Fast retrieval of similar subsequences in long sequence databases
Author :
Park, Sanghyun ; Lee, Dongwon ; Chu, Wesley W.
Author_Institution :
Dept. of Comput. Sci., California Univ., Los Angeles, CA, USA
Abstract :
Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use high-cost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the average length L¯ of data sequences. We propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to linear to L¯. We also present an indexing technique to speed-up the aligned subsequence matching using the similarity measure of the modified time warping distance. Experiments on synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed sequential scanning and achieved an up to 6.5 times speed-up
Keywords :
database theory; query processing; string matching; Euclidean distance; aligned subsequence matching; data mining; editing distance; high-cost distance functions; indexing technique; information retrieval; long sequence databases; modified time warping distance; query sequence; similar subsequences; similarity measure; subsequence matching scheme; synthetic data sequences; time warping distance; Computer science; Data mining; Databases; Ear; Euclidean distance; Indexing; Information retrieval; Sampling methods; Time measurement; Velocity measurement;
Conference_Titel :
Knowledge and Data Engineering Exchange, 1999. (KDEX '99) Proceedings. 1999 Workshop on
Conference_Location :
Chicago, IL
Print_ISBN :
0-7695-0453-1
DOI :
10.1109/KDEX.1999.836610