DocumentCode :
9162
Title :
VChunkJoin: An Efficient Algorithm for Edit Similarity Joins
Author :
Wei Wang ; Jianbin Qin ; Chuan Xiao ; Xuemin Lin ; Shen, Heng Tao
Author_Institution :
Sch. of Comput. Sci. & Eng., Univ. of New South Wales (UNSW), Sydney, NSW, Australia
Volume :
25
Issue :
8
fYear :
2013
fDate :
Aug. 2013
Firstpage :
1916
Lastpage :
1929
Abstract :
Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.
Keywords :
constraint handling; data handling; data integration; pattern recognition; VChunkJoin; data cleaning; data integration; edit distance constraint; edit similarity joins; filtering methods; nonoverlapping substrings; overlapping grams; pattern recognition; record linkage; tail-restricted chunk boundary dictionary; Algorithm design and analysis; Approximation algorithms; Arrays; Dictionaries; Filtering; Indexes; Robustness; Edit similarity joins; approximate string matching; edit distance; prefix filtering; q-gram;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2012.79
Filename :
6185552
Link To Document :
بازگشت