DocumentCode :
710112
Title :
Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search
Author :
Jin Wang ; Guoliang Li ; Dong Deng ; Yong Zhang ; Jianhua Feng
Author_Institution :
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
fYear :
2015
fDate :
13-17 April 2015
Firstpage :
519
Lastpage :
530
Abstract :
String similarity search is a fundamental operation in data cleaning and integration. It has two variants, threshold-based string similarity search and top-k string similarity search. Existing algorithms are efficient either for the former or the latter; most of them can´t support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (HS-Tree) on top of the segments. Then we utilize the HS-Tree to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-k search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings, and propose an algorithm (HS-Topk). We also develop effective pruning techniques to further improve the performance. Experimental results on real-world datasets show our method achieves high performance on the two problems and significantly outperforms state-of-the-art algorithms.
Keywords :
data integration; string matching; tree searching; HS-tree; data cleaning; data integration; hierarchical segment tree index; pruning techniques; threshold-based string similarity search; top-k string similarity search; Blogs; Heuristic algorithms; Indexes; Partitioning algorithms; Search problems; Silicon; Upper bound;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2015 IEEE 31st International Conference on
Conference_Location :
Seoul
Type :
conf
DOI :
10.1109/ICDE.2015.7113311
Filename :
7113311
Link To Document :
بازگشت