DocumentCode
14014
Title
Efficiently Supporting Edit Distance Based String Similarity Search Using B
-Trees
Author
Wei Lu ; Xiaoyong Du ; Hadjieleftheriou, MariosMarios ; Beng Chin Ooi
Author_Institution
Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore
Volume
26
Issue
12
fYear
2014
fDate
Dec. 2014
Firstpage
2983
Lastpage
2996
Abstract
Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B+-tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B+-tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B+-tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B+-tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.
Keywords
computational complexity; formal languages; query processing; relational databases; trees (mathematics); B+-trees; KNN queries; NP-hard problem; RDBMSs; data set optimal partitioning; edit distance based string similarity queries; edit distance based string similarity search; filter-and-verify framework; heuristic approach; metric space; optimal partition assignment strategy; pruning techniques; query evaluation; range queries; reference strings; Educational institutions; Electronic mail; Indexing; NP-hard problem; Partitioning algorithms; B$^+$ -tree; Similarity search; edit distance; string;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2014.2309131
Filename
6750698
Link To Document